One of the primary features I leverage for content creation involves the capability to extract metadata from web pages. ColdFusion, in conjunction with the jSoup ColdBox module, greatly simplifies this task.
Getting started with jSoup is straightforward when you're using CommandBox to manage your development server. You can easily add the module by running the following command:
box install cbjsoup
For more details, you can refer to this link: GitHub Gist.
I employ the jSoup `connect()` method to access the document, although it's also possible to fetch the HTML using cfhttp. Notably, the code is configured to follow redirects and modify the user agent to bypass potential filters. It's worth mentioning that I've observed certain web application firewalls rejecting the request. This could potentially be due to their blocking of spider requests originating from my VPS provider, Digital Ocean. The same request functions seamlessly when executed locally.
The subsequent step involves iterating through the meta properties to extract opengraph and Twitter metadata. This extracted data is then provided to the caller for further processing.
While this example is tailored to ColdBox, it can be readily adapted for use in non-ColdBox applications.