Rewrite HTML attributes after parsing
Implement IHtml
To rewrite anchors, inject attributes, normalize URLs, or strip sentinels in already-rendered HTML, implement IHtmlResponseRewriter. Every rewriter shares one AngleSharp parse against the same IDocument. For non-HTML response types (JSON, plain text) or work that needs the final byte stream, use Transform the response body on every page instead.
The recipe references examples/ExtensibilityLabExample/AnchorLowercaseRewriter.cs, which exercises both phases of the contract against a bare AddPennington host.
Before you begin
- An existing Pennington site rendering HTML pages (see Create your first Pennington site if not).
- A clear sense of which phase fits the edit: a non-HTML token (something not valid HTML structure, like
<xref:uid>or a sentinel comment) belongs inPreParseAsync; anything queryable by selectors belongs inApplyAsync.
Write the rewriter
Implement Pennington.Infrastructure.IHtmlResponseRewriter as a sealed class. Three rules carry the page:
ShouldApplyruns per-response; returnfalseto skip both phases when the content-type, path, or headers mean there is nothing to do. The example narrows totext/htmlresponses so non-HTML endpoints (search index JSON, llms.txt) bypass the rewriter entirely.PreParseAsyncreceives the raw HTML string and returns the string to parse. Use it only when the target construct is not valid HTML structure — raw<xref:uid>tags are the canonical shipped example. Return the input unchanged when there is nothing to do.ApplyAsyncreceives the already-parsedIDocumentshared by every rewriter — query withQuerySelectorAll, mutate attributes and text, and return. Do not re-serialize or reparse.
namespace ExtensibilityLabExample;
using AngleSharp.Dom;
using AngleSharp.Html.Dom;
using Microsoft.AspNetCore.Http;
using Pennington.Infrastructure;
/// <summary>
/// Implements <see cref="IHtmlResponseRewriter"/> and demonstrates both
/// halves of the contract:
/// <list type="bullet">
/// <item><description><see cref="PreParseAsync"/> runs a cheap string
/// replace over the raw HTML before AngleSharp parses it. We use it to
/// strip the <c><!--LOWERCASE-SENTINEL--></c> comment — the kind
/// of pre-parse cleanup a real rewriter does for non-HTML tokens like
/// <c><xref:uid></c>.</description></item>
/// <item><description><see cref="ApplyAsync"/> walks the parsed document
/// and lowercases the text content of every <c><a></c> tag
/// marked <c>data-lowercase</c>.</description></item>
/// </list>
/// <para>
/// <see cref="Order"/> is 500 — after the shipped xref (10), locale (20),
/// and base-URL (30) rewriters so our pass sees already-resolved hrefs.
/// </para>
/// <para>
/// Backs how-to 2.3.50 <c>/how-to/extensibility/html-rewriter</c>.
/// </para>
/// </summary>
public sealed class AnchorLowercaseRewriter : IHtmlResponseRewriter
{
public int Order => 500;
public bool ShouldApply(HttpContext context)
{
var contentType = context.Response.ContentType;
return contentType is not null
&& contentType.StartsWith("text/html", StringComparison.OrdinalIgnoreCase);
}
/// <summary>
/// Pre-parse pass. Strip the sentinel comment so it is gone before
/// AngleSharp runs. A string replace is the right tool when the
/// target construct is not valid HTML structure (raw <c><xref></c>
/// tags are the canonical example shipped with Pennington).
/// </summary>
public Task<string> PreParseAsync(string html, HttpContext context)
{
if (!html.Contains("<!--LOWERCASE-SENTINEL-->", StringComparison.Ordinal))
{
return Task.FromResult(html);
}
return Task.FromResult(html.Replace("<!--LOWERCASE-SENTINEL-->", string.Empty, StringComparison.Ordinal));
}
/// <summary>
/// DOM pass. Walk the parsed document, find every <c><a></c>
/// with <c>data-lowercase</c>, lowercase its text content.
/// </summary>
public Task ApplyAsync(IDocument document, HttpContext context)
{
foreach (var element in document.QuerySelectorAll("a[data-lowercase]"))
{
if (element is not IHtmlAnchorElement anchor)
{
continue;
}
if (string.IsNullOrEmpty(anchor.TextContent))
{
continue;
}
anchor.TextContent = anchor.TextContent.ToLowerInvariant();
}
return Task.CompletedTask;
}
}
Pick an Order value
The shipped rewriters occupy Order values from 10 (xref resolution) through 60 (the last built-in transform); xref resolution, locale prefixing, and base-URL prefixing run in that relative order because each produces the link form the next one consumes. Pick above 60 to run after every shipped transform, below 10 to run before xref resolution, or between the built-ins only when that placement is deliberate. For the exact Order of each shipped rewriter, see Pennington.Infrastructure.IHtmlResponseRewriter. The example uses 500 so anchors are lowercased after every shipped transform has run.
Register the rewriter
Every registered IHtmlResponseRewriter is picked up and ordered by its Order value, so a single registration next to the host wiring is sufficient. Use the lifetime that matches your dependencies — AddSingleton for stateless rewriters, AddTransient (or AddFileWatched) when the rewriter captures file-watched state.
builder.Services.AddSingleton<IHtmlResponseRewriter, AnchorLowercaseRewriter>();
Configure the shipped word-break rewriter
One shipped rewriter you configure rather than implement is the word-break rewriter. AddWordBreak turns it on; it inserts <wbr> break opportunities into long identifiers so dotted namespaces and PascalCase names wrap inside narrow columns instead of overflowing.
builder.Services.AddWordBreak(options =>
{
options.CssSelector = "h1, h2, h3, h4, h5, h6, span, .text-break";
options.MinimumCharacters = 20;
});
A heading like Pennington.Infrastructure.WordBreakOptions then renders with breaks after each dot and before each interior case boundary:
Before:
<h3>Pennington.Infrastructure.WordBreakOptions</h3>
After:
<h3>Pennington.<wbr>Infrastructure.<wbr>WordBreakOptions</h3>
For every option and its default, see Pennington.Infrastructure.WordBreakOptions.
Result
Anchors marked data-lowercase have their text content lowercased, and the sentinel comment is gone from view-source.
Before:
<!--LOWERCASE-SENTINEL-->
<a data-lowercase href="/docs/">Read the DOCS</a>
<a data-lowercase href="/blog/">Latest POSTS</a>
After:
<a data-lowercase href="/docs/">read the docs</a>
<a data-lowercase href="/blog/">latest posts</a>
Anchors without data-lowercase and non-HTML responses pass through unchanged.
Verify
- Run
dotnet run --project examples/ExtensibilityLabExampleand visit/lowercase-demo/. Every<a data-lowercase>anchor text is lowercase in the rendered HTML and<!--LOWERCASE-SENTINEL-->is absent from view-source. - Static build:
dotnet run --project examples/ExtensibilityLabExample -- build output— grepoutput/lowercase-demo/index.htmlto confirm the rewriter also runs during publish.
Related
- Reference: Response processing interfaces
- Reference:
WordBreakOptions— the shipped word-break rewriter's configuration - Background: The response-processing pipeline
- Related how-to: Write a response processor