Home / Blog Menu ↓

Blog: #unicode

Grepping by Unicode range

TIL that ripgrep supports searching by Unicode ranges. For example:

rg "[\u0250-\u1FFF\u2027-\uFFFF]"

This greps for anything after the Latin Extended-B block, excluding some general punctuation like em/en dashes, curly quotes, and ellipses. (Useful to me for easily checking which characters I’m using on my website and thus what a new font would need to support.)


Reply via email

Update on Press (the PDF compiler). I haven’t worked on it at all lately, but I wanted to document its current state for history’s sake, and as part of working in public. (I’ve also been sitting on this post for over a year.)

Back in 2017 I did end up re-architecting Press to use Low Ink as an intermediate page description language. In the process, Low Ink changed from a JSON-based idea to this:

:page 11x8.5in
:bleedbox x=0.125in y=0.125in w=5.75in h=8.75in
:fontmap family=helv weight=regular style=normal standard=Helvetica
:yinvert
:push
:translate x=72 y=72

# ascender
:push
:translate x=0 y=1040
:strokecolor hex=#999
:linewidth 0.25pt
:line x1=0 y1=0 x2=1080 y2=0
:stroke
:push
:fillcolor hex=#999
:font family=helv size=14pt
:text x=1085 y=-3 text="ascender"
:pop
:pop

# filled glyph
:push
:translate x=1320 y=240
:fillcolor hex=#000
:moveto x=0 y=0
:pathto x=400 y=300 cx1=120 cy1=300 cx2=140 cy2=300
:pathto x=320 y=200 cx1=540 cy1=300 cx2=320 cy2=180
:lineto x=350 y=350
:lineto x=450 y=250
:lineto x=150 y=0
:moveto x=200 y=200
:lineto x=200 y=250
:lineto x=250 y=250
:lineto x=250 y=200
:lineto x=200 y=200
:fill
:pop

It was intended to be a fairly low-level wrapper on the PDF format, with the idea being that other libraries/apps would provide more ergonomic abstractions on top of it.

I initially used Python because Press started out as a library, but with the pivot to a compiler model, I think Go or Rust would probably end up being a better choice (Rust would make integrating HarfBuzz a bit easier, at any rate).

Potential improvements

To my 2021 eyes, the language design isn’t particularly elegant. I like that the parameters are named (clarity), but for most of the commands there aren’t actually that many parameters, because many of the settings that would normally be parameters are separate commands. For parameters that are clearly unambiguous, the names hamper readability. For example, I think something like this might be better:

:line 0,0 to 1080,0
:fillcolor #345

I’ve also thought that push and pop could potentially be clearer as curly braces, and that the initial colons aren’t really necessary:

{
  translate 0,1040
  strokecolor #999
  linewidth 0.25pt
  line 0,0 to 1080,0
  stroke

  {
    fillcolor #999
    font 14pt helv
    text 1085,-3 "ascender"
  }
}

The future

My initial reason for building Press was to have an easy, programmable cross-platform way to create language chart PDFs (so I could move away from PlotDevice/DrawBot), and what I’ve realized (acknowledging that I haven’t really been making language charts in recent years) is that there are some other, better options now.

One that seems decent is SVG, converted to PDF by way of Inkscape. Initial tests here seem like it would probably work fine.

Another promising option that I admittedly haven’t looked into very much yet is Paged.js. HTML and CSS are already great for declarative typesetting, and the more I’ve thought about programmatic typesetting, the more this model seems to be the future I’d want to work with (and not just because of parity with web, though that makes it much more compelling).

tl;dr I don’t see myself continuing on with Press, so we may as well call a mortem on it.


Reply via email

A spectre is haunting Unicode:

In 1978 Japan’s Ministry of Economy, Trade and Industry established the encoding that would later be known as JIS X 0208, which still serves as an important reference for all Japanese encodings. However, after the JIS standard was released people noticed something strange — several of the added characters had no obvious sources, and nobody could tell what they meant or how they should be pronounced. Nobody was sure where they came from. These are what came to be known as the ghost characters (幽霊文字).


Reply via email

Making PDFs by hand

I’ve been hand-coding PDFs in Vim, reading the PDF spec to learn how things work. It’s fascinating. My first, extremely simple PDF:

%PDF-1.4
1 0 obj << /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj << /Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 6 0 R >>
endobj
4 0 obj << /Font << /F1 5 0 R >> >>
endobj
5 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>
endobj
6 0 obj
<< /Length 44 >>
stream
BT /F1 24 Tf 175 720 Td (Hello World!) Tj ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000116 00000 n
0000000220 00000 n
0000000263 00000 n
0000000333 00000 n
trailer << /Size 7 /Root 1 0 R >>
startxref
427
%%EOF

It’s not as bad as it looks, I promise. (I’m doing PDF 1.4 because CreateSpace doesn’t seem to support higher versions of the spec.)

Anyway, I’ve been reading through chapter 5 of the spec, learning how text works in PDF. I’ve learned how to modify character spacing with Tc, word spacing with Tw, leading with TL, and individual glyph positions with TJ (not sure yet if I can change vertical positioning or not). I’ve also learned how to change the text color. It’s all been fairly straightforward.

As part of this, I’ve used Hex Fiend (an OS X hex editor) to pry apart some simple PDFs I made with PlotDevice, to see how things were encoded. The streams themselves are generally compressed through Flate compression (opposite of deflate, har har), and I found this script to easily decode the streams:

#!/usr/bin/env python

import zlib
import sys

input = sys.argv[1]
output = sys.argv[2]

with open(input, 'rb') as f:
    buffer = f.read()

decomp = zlib.decompress(buffer)

with open(output, 'w') as f:
    f.write(decomp)

I copied each stream in hex from Hex Fiend, pasted it into a file, ran the Python script on it, and it would output decoded text to a new file.

Things I don’t know/understand yet, which are legion:

  • How to encode Unicode (I’m not to this point of the spec yet, but I believe it involves CID fonts and using cmaps to map glyph codes or something like that).
  • How to take a font name and, in a cross-platform way, get the path to the font file so I can embed it and also use it with HarfBuzz.
  • How to take the output of HarfBuzz (a list of glyphs with position coordinates for each) and use that in positioning the glyphs in the PDF. I believe HarfBuzz will handle parsing the OpenType features of the font, but I’m not positive on that. I did get HarfBuzz Python bindings working, though, and I plan to play around with it soon.
  • Whether I need to use FreeType at all. I might need it for font metrics, but HarfBuzz might give me everything I need there.
  • When typesetting multiple lines, I don’t know whether it’s best to use the PDF built-in support (T* and TL and such), or to set each line manually as its own text object. The built-in support seems better, though I don’t know if that limits what’s possible.

At some point soon — I think when I start embedding fonts — doing this by hand in Vim will stop being as feasible, and at that point I’ll start writing Python to manage the PDF creation process for me. For now, though, it’s easier to just edit the PDF manually.


Reply via email

Unicode Inspector

I’ve lately had the need to find what the code points are for some Unicode text, so I wrote a little web app:

Basically, you type in text and it tells you what the Unicode hex codes are. Pretty simple. There’s a live version on GitHub.

Nerdy notes

  • I’m using punycode.js to do the conversion.
  • I haven’t yet tested it with anything above U+FFFF.
  • Firefox shows the dotted circle for combining marks, but Chrome sadly doesn’t. (Which is why I used Firefox for the screenshot.)
  • At some point I’d like to add more information about the characters — Unicode name, classification, link to chart, etc.

Reply via email

Javascript entity conversion

For future reference: if you’re using Javascript and want to convert a decimal entity in HTML (Đ, for example) to the Unicode character it represents (“Đ”), this works:

// converts "fiancé" to "fiancé", etc.

newstr = oldstr.replace(/&#([0-9]*);/,
            function(full, charcode) {
                return String.fromCharCode(charcode);
            });

The full parameter is ignored; we want the second one, charcode, which is the first backreferenced match in the regex (the character code).


Reply via email