Bulk-editing HTML files with images containing a # sign

A friend of mine recently pinged me with a problem he’s having. I wanted to share the problem and its solution, both because I hope it will help him and maybe others, and because my initial attempts at the problem were way more complicated than they needed to be.

The problem

My friend used some software to generate a whole bunch of HTML reports, and the reports contained images with names like “Project #1 – Blah/Image #1.jpg” or such. None of the images showed up.

He was savvy enough to recognize that the # (octothorp, pound sign, number sign, “hashtag symbol,” etc.) was the problem, but this was little comfort: with about 100 generated HTML files, each with six images, editing it all was perhaps an entire day’s work.

The problem is that the pound-sign is a special character in URLs, the fragment identifier. (I admit, I had to look up what it’s actually called.)

The over-engineered fix

What I struggled with was this: I could easily write a quick shell script to rename all the folders, and then mass-edit the HTML files. It would probably be 5-10 minutes of work.

What I couldn’t do easily, though, was write a Ruby (or whatnot) script without seeing the files and then get my friend to run it, on what I presume is a Windows computer without a Ruby interpreter. It’s one of those problems that I could very easily fix for myself, but communicating it or getting it to run for someone else is tremendously more complicated.

So I proposed that he just create a Zip file with the entire folder contents, including 600 images, and send it to me to fix.

This is a bad fix for two reasons. One, emailing 600 images is crazy. But two, there’s a much easier fix that hadn’t occurred to me.

Tech-savvy, non-programmer fix

Having suggested the above fix at first, I mulled over something else he said: this was one of those things he wasn’t sure how to even search for on Google, because he didn’t really know the terms. I think this is a problem we’ve all faced at one time. (For example, when I dropped a tiny little weird-shaped screw into my carpet and wanted to buy a replacement. If you have no idea what the thing is called, and I didn’t, finding it online will be even harder than finding it in your carpet.)

I wondered: what could you search for? I tried “how to bulk-edit HTML files” or something of the sort. It likely wouldn’t have gotten him the details he needed. But it helped me realize an easier fix!

There’s an easy way to bulk-edit text files, beyond writing a script to do it. And it’s something that programmers in particular should be familiar with: text editors.

And then something else occurred to me: you don’t actually need to rename the files! You just need to percent-encode the # in the URL!

So this suddenly becomes a much easier solution to relay.

  • Back up your work!
  • Download a free and reputable text editor. Notepad++ is highly-regarded on Windows, but I’m not a Windows user. I use Sublime Text on my Mac, and I still have a TextMate install as well.
  • Open the entire folder in the text editor.
  • Use the editor’s global search-and-replace / find-and-replace functionality. In Notepad++, it looks like it’s called Find in Files.
  • Make sure you really do have the project backed up; global search and replace on a project can be a real pain to undo if it goes wrong.
  • Search for # in all HTML files (*.html) and replace it with the percent-encoded representation, %23. This can look ugly: “Photos #1/pic.png” becomes “Photos %231/pic.png”. (The space should technically be %20, because a space is also a special character, but browsers are better about figuring that out.) Make sure “Replace all” or the equivalent is selected.
  • Ideally, test it in one file, save that file, and verify that the images now show up.
  • Let it roar through all of the files for you.

There’s one other risk worth noting, which is that the # sign could be used elsewhere in the file, particularly in its correct use as a fragment identifier, like if there was a link to different sections within the page. In that case, you would need to be slightly more selective in your search and replace, and this is where it gets gory. You’d need to find a search string that matches only the image URLs, make like “Photos #”, and then replace it with the same thing, but with the “#” changed to a “%23” — i.e., search for “Photos #” and replace it with “Photos %23”.