Updated 2004-06-28 10:21:28

George Peter Staplin: I've had a design floating in my brain for over a year for a text-like widget that is simple and robust. Parts of the design are incomplete and I would appreciate help.

Here are some ideas relating to that:

Each line will consist of a structure like this:
 typedef struct TextFieldLine_s {
struct TextFieldLine_s *next;
struct TextFieldLine_s *previous;
unsigned int width;
unsigned int height;
unsigned char *b;
size_t b_size;
} TextFieldLine;

Note: The pointers next and previous are used to implement our double-linked list. This should reduce the cost of inserting and deleting lines or contents of lines.

Now as for the data stored in *b we will define some markers so we would have something like:

MARKER0x01MARKERText

In the current design 0x00/NUL would be the MARKER.

In the case above 0x01 is the id of our text context. 0x01 could be looked up as an offset into an array that stores a structure with the GC and various other information needed to draw the text or measure it. I like to call this structure a Text Context or TC.

A TC would be something like this:
 typedef struct TextFieldLineList_s {
TextFieldLine *line;
struct TextFieldLineList_s *next;
} TextFieldLineList;

typedef struct {
int width; /*only used for images*/
int height;
GC gc;
TextFieldLineList list; /*our list of lines using this TC*/
Pixmap pix; /*for images*/
} TC;

For proper scrollbar operation we will need to measure each line for its width and height. We also need to take into account the entire width and height, so that the ratio is proper. As lines wrap we will need to calculate the amount of height that is added from the wrap. This presents a problem, because we can't treat each line as the same height and have it always appear properly in the rendered display. To further complicate the matter some TCs will only be images.

The text will be rendered and measured using a state machine. There would be two basic states for measuring. If the MARKER char is received then we enter a state of TC and use the next character as an offset into our lookup table for the TC. Once we have that structure from the offset we set it as our current TC (variable cur_tc). We then proceed until the end MARKER and we can then begin to process our Text. Once we have reached cur_line->b_size for our cur_line->b array or another MARKER we would invoke a measure of the text via a Tk function, such as Tk_TextWidth(). The height will not be dynamic like the width, so we can rely on the cur_tc->height for our final calculations of the total height of the line (note wrapping introduces some extra complexity for this to work properly).

Caveat: a byte/pow(2,8)-1 may not be large enough for complex text that has many tags/TCs. It might be better to say that the TC offset is a short stored in two bytes and terminated by a MARKER. That brings to mind the possibility that a MARKER to terminate a TC offset may not be needed. If we use a standard size for all TC offsets then it shouldn't be needed, and we would save quite a few bytes over the run of a massive amount of text.

What am I missing? It seems like proper wrapping would involve a more complex structure to handle height differences properly. For example if we have one chunk of text marked with a TC that has a height of 12 and another on the same line marked with a TC that has a height of 20 we may have wrapping strangeness. How would you solve this?

pwq 28 Jun 04, You need to think of boxes within boxes. I think you would be better off constructing a tree structure based on TC, such that you have text leafs for before TC and after TC. Having the codes and tags embedded in the text will be complicated to process.

pwq 10 Jun 04, You have not stated the requirements of the widget, i.e., what it can and can't do, so it is hard to provide meaningful help.

I suggest that you consult the sources to TeX. This gives the basics on how you would layout paragraphs as well as include graphics.

The basic engine gives a framework for doing everything to do with text layout; everything else is built out of macros that are of no importance for your widget. See the TeX archive [1].

George Peter Staplin: The requirements are:

1. proper scrollbar behavior without a background calculation task (like the current 8.5 widget)
2. a simpler design than Tk's current text
3. tag-like attributes for text (colors, fonts)

escargo - I agree that looking at TeX is a place to start. You are only trying to determine the bounding box for each line, and where to put line breaks. As I recall, algorithms to do this have been called dynamic programming problems, which might be enough of a hint to find existing solutions to the problem.

LV GPS, when you say simpler design - do you mean less features or do you mean internally, implement the same features in a more straight-forward manner?

George Peter Staplin: A more readable/factored design. It will probably have fewer features, but I will expose more structure to Tcl in order to allow the programmer more flexibility. Exposing more structure to Tcl and actually storing the widget structures has worked quite well with my Smalltick widgets.

Lars H: I agree that looking at TeX should be the way to start, but you'll probably find The TeXbook (ISBN 0-201-13448-9; it might be available in a library Near You) more accessible as a description of how it works than the actual TeX sources. It can all be found in ftp://www.ctan.org/pub/tex-archive/systems/knuth/ though. In case the process of typesetting Web sources is unfamiliar to you, you can find a PDF of tex.web at [2] (6.0MB).

In short, the model used in TeX is that every line of text is a "box". Boxes have width, height, depth, and contents, so this is pretty close to your TextFieldLine_s above. The reason to keep track of height (extent above the baseline) and depth (extent below the baseline) separately is that the typographical goal is to have all baselines at equal distance, unless something extends very high or low in a line. In order to know whether this will happen, you need to check that the depth of one line plus the height of the following line does not exceed the relevant limit for the baseline separation you've chosen.

The contents of a line box (horizontal box, or \hbox for the TeXie) in TeX is a list of "horizontal material", most of which is typically ordinary characters (from some font) and spaces. I suspect it would be most efficient in practice to store the contents as a string (at least initially; if that isn't good enough then you can always do a more native reimplementation later) where some character has been designated as a "mode shift" character -- not unlike your MARKER above. In the basic mode, characters are simply characters in the current font and spaces are the current interword space. In the shifted mode, the string should instead be a list of more special items to put into the line, e.g. images. Writing <SHIFT> for the shift character, the contents of a box could look like this:
  line, e.g. images, such as this: <SHIFT>
{image/jpeg {w 60 h 10 d 2} data/images/caterpillar.jpeg}
{glue {w 10 wplus 2}}<SHIFT>This extraordinary little creature

The idea here is that all special items are lists where the first element specify their type and the second element is a dictionary which specifies their metrics (at least those metrics that are nonzero). This way, the automaton (or whatever) that parses the contents to determine the metrics of a line as a whole doesn't have to know every special kind of item that can appear, it is sufficient if it can measure ordinary text and add things up.

As for wrapping, that is a slighly separate process in TeX. First all the text from an entire paragraph is collected on "the main horizontal list" (which is just like the list of contents for an \hbox) and then TeX seeks the best way to break it into lines of a specified length (the \hsize). You would probably not want to do it like TeX in an editable text field, because it could make words jump back and forth between lines as you type.

pwq 28 Jun 04, NEM: A good summary. I have looked at TeX internals a number of times and the layout model becomes complicated when you want to do something like flow text around a figure within a table in a column. As can be seen by the complex macros required to do this (and they fail for some simple cases).

I am currently coding up a constraint based layout manager using LP and FD constraints. This seems to give the most flexability for arbitary layouts. It also means that for GPS's example , the TC would manage themselves and just have to respond to events during typing to re constrain the layout.

It has been done before on a number of packages, most recent is SCED.

Category Widget