Tidying up HTML with jSoup: Part Deux
Robert shows how to impliment tidy HTML using jSoup, again commandbox cfmlMay 1, 2024 / Robert Zehnder
After my last post a couple of buddies said I did not really explain the solution and the post felt unfinished. After going back and reading it, I would agree. I will go in to more detail here and explain things in more detail.
The output of commandbox-ssg
has always been something that makes my OCD tingle. When build
generates a site, templates are rendered in steps: first the view gets rendered, the next step is to render the page layout around the view, and finally the layout is applied. Due to how things are processed the indentation is "chunky" and the rendering process will also generate blank when processing the CFML templates.
Below is an example:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>KISDigital</title>
<meta name="description" content="ColdFusion, ColdBox, CommandBox and other assorted musings">
<meta name="author" content="Robert Zehnder">
<meta name="twitter:widgets:theme" content="light">
<meta name="twitter:widgets:border-color" content="#55acee">
<meta property="og:title" content="KISDigital" />
<meta name="twitter:title" content="KISDigital" />
<meta name="twitter:card" content="summary_large_image" />
<meta property="og:description" content="ColdFusion, ColdBox, CommandBox and other assorted musings" />
<meta name="twitter:description" content="ColdFusion, ColdBox, CommandBox and other assorted musings" />
<meta property="og:image" content="https://static.kisdigital.com/kisdigital-logo.jpg" />
<meta name="twitter:image" content="https://static.kisdigital.com/kisdigital-logo.jpg" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism-themes/1.9.0/prism-one-dark.min.css" integrity="sha512-c6S8OdtvoqZCbMfA1lWE0qd368pLdFvVHVILQzNizfowC+zV8rmVKdSlmL5SuidvATO0A7awDg53axd+s/9amw==" crossorigin="anonymous" referrerpolicy="no-referrer" />
<link rel="stylesheet" href="/assets/css/site.css?v=1714584743059">
</head>
<body style="padding-top: 70px;">
<header>
<nav class="header">
<div class="container">
<div class="row">
<div class="col-2">
<a class="site-link" href="/"><span style="color: var(--post-link-text)">KIS</span>Digital</a>
</div>
<div class="col-10"></div>
</div>
</div>
</nav>
</header>
<div class="container text-white">
<div class="row">
<div class="col-lg-8 col-md-12">
<div class="card rf-card-bordered text-white">
<div class="card-body">
<div class="card-text">
<div class="row mb-5">
<div class="col-12 p-1">
... more html
The output is not bad, it just is not great.
I thought about a few different options to solve the issue. First, I thought about using an html post-processor to run after the build process. That would work but it defeats the purpose of a ColdFusion static site generator if you have to drop back to javascript.
I was optimistic about Ben Nadel's dedentContent()
method he blogged about on his cf_savefile
custom tag. It is a very cool idea, but are potential issues when dealing with preformatted content and code tags. I am sure it could be accounted for, but there are smarter ways to handle it.
Which brings me to the solution I used: jSoup. Using jSoup with ColdBox is as easy as box install cbjsoup
with Don Bellamy's cbjsoup module. I have used this module extensively and it is defintely one of my favorite ColdBox modules.
Unfortunately, the module will not work for a CommandBox custom module, but it does provide the jsoup-1.14.3.jar
which is all that is really needed.
Getting this working with commandbox-ssg
was a matter of dropping the jar in the models/lib
in the commandbox-ssg
module directory. The build
command needs a way to call jSoup's parse()
method to process the ugly html and return clean, tidy html so I created a simple service that only exposes the parse()
method of the library.
// models/JSoup.cfc
component {
LIB_PATHS = directoryList(
getDirectoryFromPath( getCurrentTemplatePath() ) & "lib",
false,
"path"
);
JSoup function init(){
variables.jsoup = createObject( "java", "org.jsoup.Jsoup", LIB_PATHS );
return this;
}
function parse( html ){
return variables.jsoup.parse( html );
}
}
The build
command needs to be able to reference the new service, so it is injected into the variables
scope.
property name="jSoup" inject="JSoup@commandbox-ssg";
Inside the renderTemplate()
method after all HTML has been generated the content is piped through jSoup's parse()
.
/**
* returns rendered html for a template and view
*
* @prc request context for the current page
*/
function renderTemplate( prc ){
var renderedHtml = "";
var template = "";
try {
// template is CF markup
if ( prc.inFile.findNoCase( ".cfm" ) ) {
if ( process.hasIncludes && process.views.find( prc.view ) && prc.layout != "none" ) {
// render the cfml in the template first
template = fileSystemUtil.makePathRelative( prc.inFile );
savecontent variable="prc.content" {
include template;
}
// overlay the view
template = fileSystemUtil.makePathRelative( cwd & "_includes/" & prc.view & ".cfm" );
savecontent variable="renderedHtml" {
include template;
}
} else {
// view was not found, just render the template
template = fileSystemUtil.makePathRelative( prc.inFile );
savecontent variable="renderedHtml" {
include template;
}
}
}
// template is markdown
if ( prc.inFile.findNoCase( ".md" ) ) {
if ( process.hasIncludes && process.views.find( prc.view ) ) {
template = fileSystemUtil.makePathRelative( cwd & "_includes/" & prc.view & ".cfm" );
savecontent variable="renderedHtml" {
include template;
}
} else {
renderedHtml = prc.content;
}
}
// skip layout if "none" is specified
if (
prc.layout != "none" &&
process.hasIncludes &&
process.layouts.find( prc.layout )
) {
template = fileSystemUtil.makePathRelative( cwd & "_includes/layouts/" & prc.layout & ".cfm" );
savecontent variable="renderedHtml" {
include template;
}
}
} catch ( any e ) {
error( prc.inFile & " :: " & e.message );
}
// a little whitespace management
return trim( JSoup.parse( renderedHtml ) );
}
The final line is where all the magic happens. Calling jSoup parse()
on the raw output will return tidy HTML which is ready to be written to disk and served on your preferred static site host. If you were to view the source now you would see the output no longer has blank lines and chunky indentation.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>KISDigital</title>
<meta name="description" content="ColdFusion, ColdBox, CommandBox and other assorted musings">
<meta name="author" content="Robert Zehnder">
<meta name="twitter:widgets:theme" content="light">
<meta name="twitter:widgets:border-color" content="#55acee">
<meta property="og:title" content="KISDigital">
<meta name="twitter:title" content="KISDigital">
<meta name="twitter:card" content="summary_large_image">
<meta property="og:description" content="ColdFusion, ColdBox, CommandBox and other assorted musings">
<meta name="twitter:description" content="ColdFusion, ColdBox, CommandBox and other assorted musings">
<meta property="og:image" content="https://static.kisdigital.com/kisdigital-logo.jpg">
<meta name="twitter:image" content="https://static.kisdigital.com/kisdigital-logo.jpg">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism-themes/1.9.0/prism-one-dark.min.css" integrity="sha512-c6S8OdtvoqZCbMfA1lWE0qd368pLdFvVHVILQzNizfowC+zV8rmVKdSlmL5SuidvATO0A7awDg53axd+s/9amw==" crossorigin="anonymous" referrerpolicy="no-referrer">
<link rel="stylesheet" href="/assets/css/site.css?v=1714586005784">
</head>
<body style="padding-top: 70px;">
<header>
<nav class="header">
<div class="container">
<div class="row">
<div class="col-2"> <a class="site-link" href="/"><span style="color: var(--post-link-text)">KIS</span>Digital</a>
</div>
<div class="col-10"></div>
</div>
</div>
</nav>
</header>
<div class="container text-white">
<div class="row">
<div class="col-lg-8 col-md-12">
<div class="card rf-card-bordered text-white">
<div class="card-body">
<div class="card-text">
<div class="row mb-5">
<div class="col-12 p-1"> <a href="/posts/2024/04/tidying-up-html-with-jsoup-part-deux" class="text-decoration-none">
In the end, I was able to get commandbox-ssg
to output nicely formatted HTML using only ColdFusion, node
not required.