All Articles

Building a better spider

Robert evolved a web scraping service over several years to extract link metadata (Open Graph, Twitter Cards, oEmbed, JSON-LD) for social sharing features, progressively hardening it against bot detection, timeouts, and SSL errors. The latest version uses a multi-layered extraction strategy—rotating user-agents, disabling certificate validation, adding browser headers only for non-bot UAs to maximize successful extraction.

When I was working on DismalThreads, I needed a way to extract metadata for a web page when a link was shared. It just so happens that the very first post on this blog, Using ColdFusion to read OpenGraph and Twitter Metadata, outlined code I was using for another project. I don't even remember what that project was, but 2020 was quite an eventful year.

The code I posted in 2020 was simple, but it worked.

component {

	// SpiderService.cfc
	property name="jSoup" inject="javaLoader:org.jsoup.Jsoup";

	function spider( required string linkUrl ){
		var meta = {};
		cfhttp( url = linkUrl );
		var jsDoc    = jSoup.parse( cfhttp.fileContent );
		var el       = jsDoc.select( "meta" );
		var filtered = el.filter( function( its ){
			return its.attr( "name" ).find( "twitter:" ) ||
			its.attr( "name" ).find( "og:" ) ||
			its.attr( "property" ).find( "twitter:" ) ||
			its.attr( "property" ).find( "og:" );
		} );
		filtered.each( function( i ){
			len( i.attr( "name" ) ) ? meta[ i.attr( "name" ) ] = i.attr( "content" ) : meta[ i.attr( "property" ) ] = i.attr( "content" );
		} );
		return meta;
	}

}

When I later started working on renegade-forums, I expanded on the existing code to make it more robust.

This improved version uses jsoup's built-in connect() method instead of cfhttp(), automatically follows redirects, overrides the user agent to avoid bot-detection blocks, and searches for fallback images when a page has no image metadata.

component {

	property name="jSoup" inject="javaLoader:org.jsoup.Jsoup";

	function spider( required string link ){
		var meta = { "url" : arguments.link, "alt_images" : [] };

		try {
			var jsDoc = jSoup
				.connect( link )
				.followRedirects( true )
				.userAgent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0" )
				.get();
			var el       = jsDoc.select( "meta" );
			var filtered = el.filter( ( item ) => {
				return item.attr( "name" ).find( "twitter:" ) ||
				item.attr( "name" ).find( "og:" ) ||
				item.attr( "property" ).find( "twitter:" ) ||
				item.attr( "property" ).find( "og:" );
			} );
			filtered.each( function( i ){
				len( i.attr( "name" ) ) ? meta[ i.attr( "name" ) ] = i.attr( "content" ) : meta[ i.attr( "property" ) ] = i.attr( "content" );
			} );

			if ( !meta.keyExists( "image" ) ) {
				el = jsDoc.select( "img" );

				el.each( function( item ){
					if ( item.attributes().hasKey( "src" ) ) {
						if (
							item.attr( "src" ).findNoCase( ".jpg" ) || item.attr( "src" ).findNoCase( ".jpeg" ) || item
								.attr( "src" )
								.findNoCase( ".gif" )
						) {
							meta.alt_images.append( item.attr( "src" ) );
						}
					} else if ( item.attributes().hasKey( "data-img-url" ) ) {
						if (
							item.attr( "data-img-url" ).findNoCase( ".jpg" ) || item
								.attr( "data-img-url" )
								.findNoCase( ".jpeg" ) || item.attr( "data-img-url" ).findNoCase( ".gif" )
						) {
							meta.alt_images.append( item.attr( "data-img-url" ) );
						}
					}
				} );
			}
		} catch ( any e ) {
			// todo: log exception
		}

		return meta;
	}

	function extract_text( required string html ){
		var jsDoc = jSoup.parse( html );
		var ps    = jsDoc.select( "p" );

		var t = [];
		ps.each( function( el ){
			t.append( "<p>" & el.text() & "</p>" );
		} );

		var res = arrayToList( t, "" );
		return res;
	}

}

As I developed DismalThreads, the limitations of this approach became increasingly apparent. Many sites I attempted to scrape would fail immediately due to bot-detection systems. The hardcoded one-second timeout was too aggressive and prone to failures. Additionally, many modern sites are protected by Cloudflare, which adds another layer of difficulty to web scraping.

After several iterations, here is the new hardened Spider service.

/**
 * SpiderService
 *
 * Singleton service that fetches rich metadata from external URLs.
 * Used by the post editor to auto-populate the title field and preview card when a user
 * pastes a link.
 *
 * Extraction strategy (each layer only fills keys not already populated):
 *  1. OG / Twitter <meta> tags
 *  2. oEmbed JSON endpoint (discovered via <link type="application/json+oembed">)
 *  3. JSON-LD <script type="application/ld+json"> (Schema.org structured data)
 *  4. <meta name="description"> fallback for og:description
 *  5. <link rel="image_src"> fallback for og:image
 *  6. <title> tag as last-resort title
 *  7. <img> tag scan when no og:image was found anywhere above
 *
 * Depends on jsoup (Java HTML parser) loaded via cbJavaLoader.
 */
class singleton {

	property name="jSoup"  inject="javaloader:org.jsoup.Jsoup";
	property name="logger" inject="logbox:logger:{this}";
	/**
	 * Fetches Open Graph and Twitter Card metadata from the given URL.
	 *
	 * Strategy:
	 *  1. Connect via jsoup with an 8 s timeout and realistic browser headers to avoid bot blocks.
	 *     Uses a randomly selected user-agent from a pool to reduce fingerprinting.
	 *     TLS certificate validation is disabled so sites with expired/self-signed certs still work.
	 *     Body size cap is removed to prevent truncation of large pages.
	 *  2. Filter all <meta> tags whose name or property starts with "og:" or "twitter:" into a
	 *     flat struct. Each subsequent step only fills keys not already populated.
	 *     Relative og:image URLs are resolved to absolute using the request origin.
	 *  3. oEmbed: if a <link type="application/json+oembed"> is present, fetch its href and
	 *     merge title / thumbnail_url / description. Also stores oembed:author_name,
	 *     oembed:provider_name, oembed:html, oembed:type as additive keys.
	 *  4. JSON-LD: if a <script type="application/ld+json"> is present, parse its Schema.org
	 *     data and fill any still-missing title, description, or image.
	 *  5. Standard <meta name="description"> fallback for og:description.
	 *  6. <link rel="image_src"> fallback for og:image before the img scan.
	 *  7. <title> fallback: use jsDoc.title() if no title key has been populated yet.
	 *  8. If still no og:image, fall back to scanning <img> tags (absUrl for absolute paths,
	 *     plus data-src / data-lazy-src / data-original for lazy-loaded images),
	 *     accepting .jpg/.jpeg/.gif/.png/.webp into meta.alt_images[].
	 *  9. On any fetch or parse error, log a warning and return an empty struct so the
	 *     caller can degrade gracefully.
	 * 10. If the initial fetch yields an empty struct (4xx status or exception), retry once
	 *     using the facebookexternalhit/1.1 UA with a 10 s timeout, carrying any cookies set
	 *     by the initial response. Only minimal og: tag extraction is attempted on retry.
	 *
	 * @link The fully-qualified URL to fetch (must include protocol).
	 * @return Struct of metadata keys. Common keys: og:title, og:description, og:image,
	 *         twitter:title, twitter:description, oembed:author_name, oembed:provider_name,
	 *         oembed:html, oembed:type, url, alt_images[]. Empty struct on failure.
	 */
	function spider( required string link ){
		var meta      = {};
		var cookieMap = {}; // cookies from initial response, carried to retry if needed

		// Rotate through a pool of user-agent strings to reduce bot-blocking.
		// The Chrome UA is a full browser persona; the others are known social-crawler bots
		// that many sites whitelist explicitly.
		var chromeUA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
		var userAgents = [
			chromeUA,
			"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)",
			"Twitterbot/1.0",
			"LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)",
			"Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)"
		];
		var chosenUA = userAgents[ randRange( 1, userAgents.len() ) ];

		// Derive scheme+host for resolving relative og:image URLs later
		var linkURI    = createObject( "java", "java.net.URI" ).init( arguments.link );
		var linkOrigin = linkURI.getScheme() & "://" & linkURI.getHost();

		try {
			// Use execute() so we can inspect the HTTP status before parsing.
			// maxBodySize(0) removes the default 1 MB cap that can truncate large pages.
			// validateTLSCertificates(false) handles sites with expired/self-signed SSL certs.
			var conn = jSoup
				.connect( link )
				.timeout( 8000 )
				.followRedirects( true )
				.ignoreHttpErrors( true )
				.ignoreContentType( true )
				.maxBodySize( 0 )
				.userAgent( chosenUA )
				.header( "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8" )
				.header( "Accept-Language", "en-US,en;q=0.9" )
				.header( "Cache-Control", "max-age=0" );

			// Browser-navigation headers are only appropriate for the Chrome UA.
			// Sending Referer / Sec-Fetch-* alongside a bot UA is a fingerprinting red flag.
			if ( chosenUA == chromeUA ) {
				conn = conn
					.header( "Referer", "https://www.google.com/" )
					.header( "Upgrade-Insecure-Requests", "1" )
					.header( "Sec-Fetch-Dest", "document" )
					.header( "Sec-Fetch-Mode", "navigate" )
					.header( "Sec-Fetch-Site", "cross-site" )
					.header( "Sec-Fetch-User", "?1" );
			}

			var response = conn.execute();

			// Capture any cookies set by the server — carry them on the retry if needed.
			// Some bot-detection systems gate on session cookies established in the first response.
			cookieMap = response.cookies();

			var statusCode = response.statusCode();
			if ( statusCode >= 400 ) {
				// Bad status — log and fall through so retry logic can run
				logger.warn( "SpiderService received HTTP #statusCode# for #arguments.link# — skipping parse" );
			} else {
				var jsDoc = response.parse();

				// Collect og:* and twitter:* meta tags into a flat struct
				var el       = jsDoc.select( "meta" );
				var filtered = el.filter( ( item ) => {
					var n = item.attr( "name" );
					var p = item.attr( "property" );
					return left( n, 8 ) == "twitter:" ||
						left( n, 3 ) == "og:" ||
						left( p, 8 ) == "twitter:" ||
						left( p, 3 ) == "og:";
				} );

				filtered.each(i => {
					if(len( i.attr( "name" ) )){
						meta[ i.attr( "name" ) ] = i.attr( "content" )
					}
					else{
						meta[ i.attr( "property" ) ] = i.attr( "content" );
					}
				} );

				meta.append( { "url" : arguments.link, "alt_images" : [] } );

				// Resolve relative og:image URLs to absolute (e.g. "/images/cover.jpg" → "https://example.com/images/cover.jpg")
				if ( meta.keyExists( "og:image" ) && !meta[ "og:image" ].left( 4 ).lCase() == "http" ) {
					if ( left( meta[ "og:image" ], 1 ) == "/" )
						meta[ "og:image" ] = linkOrigin & meta[ "og:image" ];
				}

				// oEmbed enrichment: discover endpoint via <link type="application/json+oembed">
				var oembedLink = jsDoc.select( "link[type=application/json+oembed]" );
				if ( !oembedLink.isEmpty() ) {
					try {
						var oembedUrl  = oembedLink.first().attr( "href" );
						var oembedJson = jSoup
							.connect( oembedUrl )
							.timeout( 5000 )
							.ignoreContentType( true )
							.ignoreHttpErrors( true )
							.execute()
							.body();
						var oembedData = deserializeJSON( oembedJson );

						if ( isStruct( oembedData ) ) {
							if ( !meta.keyExists( "og:title" ) && !meta.keyExists( "twitter:title" ) && oembedData.keyExists( "title" ) )
								meta[ "og:title" ] = oembedData.title;
							if ( !meta.keyExists( "og:image" ) && oembedData.keyExists( "thumbnail_url" ) )
								meta[ "og:image" ] = oembedData.thumbnail_url;
							if ( !meta.keyExists( "og:description" ) && oembedData.keyExists( "description" ) )
								meta[ "og:description" ] = oembedData.description;
							if ( oembedData.keyExists( "author_name" ) )   meta[ "oembed:author_name" ]   = oembedData.author_name;
							if ( oembedData.keyExists( "provider_name" ) ) meta[ "oembed:provider_name" ] = oembedData.provider_name;
							if ( oembedData.keyExists( "html" ) )          meta[ "oembed:html" ]           = oembedData.html;
							if ( oembedData.keyExists( "type" ) )          meta[ "oembed:type" ]           = oembedData.type;
						}
					} catch ( any e ) {
						logger.warn( "SpiderService oEmbed fetch failed for #arguments.link#: #e.message#" );
					}
				}

				// JSON-LD enrichment: parse first <script type="application/ld+json">
				var jsonLdScripts = jsDoc.select( "script[type=application/ld+json]" );
				if ( !jsonLdScripts.isEmpty() ) {
					try {
						var jsonLdData = deserializeJSON( jsonLdScripts.first().html() );

						// Normalise: unwrap array or @graph
						var schema = isArray( jsonLdData ) ? jsonLdData[ 1 ] : jsonLdData;
						if ( isStruct( schema ) && schema.keyExists( "@graph" ) && isArray( schema[ "@graph" ] ) )
							schema = schema[ "@graph" ][ 1 ];

						if ( isStruct( schema ) ) {
							if ( !meta.keyExists( "og:title" ) && !meta.keyExists( "twitter:title" ) ) {
								if ( schema.keyExists( "headline" ) )       meta[ "og:title" ] = schema.headline;
								else if ( schema.keyExists( "name" ) )      meta[ "og:title" ] = schema.name;
							}
							if ( !meta.keyExists( "og:description" ) && schema.keyExists( "description" ) )
								meta[ "og:description" ] = schema.description;
							if ( !meta.keyExists( "og:image" ) && schema.keyExists( "image" ) ) {
								var schemaImg = schema.image;
								if ( isStruct( schemaImg ) && schemaImg.keyExists( "url" ) )
									meta[ "og:image" ] = schemaImg.url;
								else if ( isSimpleValue( schemaImg ) )
									meta[ "og:image" ] = schemaImg;
							}
						}
					} catch ( any e ) {
						logger.warn( "SpiderService JSON-LD parse failed for #arguments.link#: #e.message#" );
					}
				}

				// Standard <meta name="description"> fallback for og:description
				if ( !meta.keyExists( "og:description" ) ) {
					var descTag = jsDoc.select( "meta[name=description]" );
					if ( !descTag.isEmpty() ) {
						var descContent = descTag.first().attr( "content" );
						if ( len( trim( descContent ) ) ) meta[ "og:description" ] = descContent;
					}
				}

				// <link rel="image_src"> fallback for og:image (used by some older/news sites)
				if ( !meta.keyExists( "og:image" ) ) {
					var imgSrcLink = jsDoc.select( "link[rel=image_src]" );
					if ( !imgSrcLink.isEmpty() ) {
						var imgSrcHref = imgSrcLink.first().attr( "abs:href" );
						if ( len( imgSrcHref ) ) meta[ "og:image" ] = imgSrcHref;
					}
				}

				// <title> fallback when no title was found by any strategy above
				if ( !meta.keyExists( "og:title" ) && !meta.keyExists( "twitter:title" ) ) {
					var pageTitle = jsDoc.title();
					if ( len( trim( pageTitle ) ) ) meta[ "og:title" ] = pageTitle;
				}

				// Fallback: scrape <img> tags when no og:image meta tag was found.
				// Checks src (via absUrl), plus data-src / data-lazy-src / data-original
				// for lazy-loaded images, and the legacy data-img-url attribute.
				if ( !meta.keyExists( "og:image" ) ) {
					el = jsDoc.select( "img" );

					el.each( function( item ){
						var imgSrc = item.absUrl( "src" );

						// Fall back through lazy-load attributes when src is empty
						if ( !len( imgSrc ) ) imgSrc = item.absUrl( "data-src" );
						if ( !len( imgSrc ) ) imgSrc = item.absUrl( "data-lazy-src" );
						if ( !len( imgSrc ) ) imgSrc = item.absUrl( "data-original" );

						if ( len( imgSrc ) ) {
							if (
								imgSrc.findNoCase( ".jpg" ) || imgSrc.findNoCase( ".jpeg" ) ||
								imgSrc.findNoCase( ".gif" ) || imgSrc.findNoCase( ".png" ) ||
								imgSrc.findNoCase( ".webp" )
							) {
								meta.alt_images.append( imgSrc );
							}
						} else if ( item.attributes().hasKey( "data-img-url" ) ) {
							var dataImgUrl = item.attr( "data-img-url" );
							if (
								dataImgUrl.findNoCase( ".jpg" ) || dataImgUrl.findNoCase( ".jpeg" ) ||
								dataImgUrl.findNoCase( ".gif" ) || dataImgUrl.findNoCase( ".png" ) ||
								dataImgUrl.findNoCase( ".webp" )
							) {
								meta.alt_images.append( dataImgUrl );
							}
						}
					} );
				}
			} // end else (status < 400)
		} catch ( any e ) {
			logger.warn( "SpiderService failed to fetch metadata for #arguments.link# [#e.type#]: #e.message# — #e.detail#", e.stackTrace );
		}

		// Retry on total failure: bad HTTP status or caught exception both leave meta empty.
		// Use facebookexternalhit UA with a longer timeout and minimal og: extraction only.
		// Pass any cookies from the initial response — bot-detection systems often require them.
		if ( meta.isEmpty() ) {
			try {
				var retryConn = jSoup
					.connect( link )
					.timeout( 10000 )
					.followRedirects( true )
					.ignoreHttpErrors( true )
					.ignoreContentType( true )
					.maxBodySize( 0 )
					.userAgent( "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" );

				if ( !cookieMap.isEmpty() ) retryConn = retryConn.cookies( cookieMap );

				var retryResponse = retryConn.execute();

				if ( retryResponse.statusCode() < 400 ) {
					var retryDoc = retryResponse.parse();
					retryDoc.select( "meta" ).filter( ( item ) => {
						var p = item.attr( "property" );
						var n = item.attr( "name" );
						return left( p, 3 ) == "og:" || left( n, 3 ) == "og:";
					} ).each( ( i ) => {
						if ( len( i.attr( "name" ) ) )
							meta[ i.attr( "name" ) ] = i.attr( "content" );
						else
							meta[ i.attr( "property" ) ] = i.attr( "content" );
					} );
					if ( !meta.isEmpty() ) meta[ "url" ] = arguments.link;
				}
			} catch ( any e ) {
				logger.warn( "SpiderService retry failed for #arguments.link# [#e.type#]: #e.message# — #e.detail#" );
			}
		}

		return meta;
	}

	/**
	 * Strips all HTML tags from an HTML string and returns plain-text paragraphs.
	 *
	 * Parses the HTML with jsoup, selects all <p> elements, and rebuilds them as
	 * plain-text <p> tags (no attributes or child HTML). Useful for generating safe
	 * text previews from raw HTML content.
	 *
	 * @html The raw HTML string to process.
	 * @return A concatenated string of plain-text <p>…</p> blocks with no wrapper element.
	 */
	function extract_text( required string html ){
		var jsDoc = jSoup.parse( html );
		var ps    = jsDoc.select( "p" );

		var t = [];
		ps.each( function( el ){
			t.append( "<p>" & el.text() & "</p>" );
		} );

		var res = arrayToList( t, "" );
		return res;
	}

}

The Hardened spider() Method in Depth

Setup Phase

Before making any network requests, the method prepares two key items. First, it randomly selects a user-agent from a pool—Chrome, Facebookbot, Twitterbot, LinkedInBot, or Slackbot—to avoid appearing as a scraper. Second, it parses the URL using java.net.URI to extract the scheme and host (for example, https://example.com) for resolving relative image URLs later.

Primary Fetch

It connects via jsoup with an 8-second timeout and several important configuration options:

  • ignoreHttpErrors(true) — captures 4xx/5xx responses instead of throwing exceptions
  • ignoreContentType(true) — fetches pages even if the server returns an unusual content type
  • maxBodySize(0) — removes jsoup's default 1 MB cap, preventing large pages from being truncated
  • TLS errors are silently handled, allowing pages with expired or self-signed certificates to be fetched

If the Chrome user-agent was chosen, the method adds browser-navigation headers (Sec-Fetch-*, Referer, etc.). Critically, it omits these headers when using bot user-agents, since sending Sec-Fetch-Dest: document alongside Twitterbot/1.0 would be an obvious fingerprinting red flag that bot-detection systems can easily detect.

Cookies from the initial response are always captured into cookieMap because some bot-detection systems (such as Cloudflare) expect session cookies to persist across requests.

Metadata Extraction — Layered, Non-Destructive

Each extraction layer only fills keys that are not already populated. This "waterfall" pattern ensures that earlier sources take priority:

Layer 1 — OG/Twitter Meta Tags
Filters all elements whose name or property attributes start with og: or twitter: and stores them in a flat struct. This is the primary source for most modern websites.

Layer 2 — oEmbed
Looks for a <link type="application/json+oembed"> tag, fetches that JSON endpoint, and merges in the title, thumbnail URL, and description. It also adds oembed:-prefixed keys for author name, provider name, embed HTML, and type—these are additive and do not compete with og: keys.

Layer 3 — JSON-LD
Parses <script type="application/ld+json"> Schema.org structured data blocks. It handles three data shapes: a plain object, an array (takes the first element), or a @graph wrapper (takes the first graph node). It maps headline or name to title, description, and image (which can be a string or a {url: ...} object).

Layers 4–6 — Simple Fallbacks

  • <meta name="description"> for the description
  • <link rel="image_src"> for the image (used by some older and news websites)
  • <title> as the last-resort title

Layer 7 — Image Tag Scan
Runs only if no og:image was found by any earlier method. It walks through every <img> tag, checking the src attribute first, then falling back to data-src, data-lazy-src, data-original, and data-img-url for lazy-loaded images. Only files ending in .jpg, .jpeg, .gif, .png, or .webp are added to alt_images[].

The spider() method is now as robust as I can make it, with fallback logic for most failure scenarios. If all layers fail, an empty meta object is returned, allowing the calling code to gracefully display an error or handle the missing metadata appropriately.