Updated 2011-11-18 22:46:54 by AMG

How do you delete one line from a file?

What kind of file are you talking about? There aren't, in general, lines in MP3 files, or GIF files, etc. In fact, most operating systems today don't have operating system calls that deal in lines. Instead, they come with what I have named, at times, "meta-read" functions - pieces of code which read in chunks of bytes from a file, and then, depending on the function's intention, add some code implementing a common interpretation of some of those bytes.

Most people, asking this particular question, are really asking

"Assume that I have a plain, traditional, text file containing 7 bit ASCII characters, where newline character indicates the end of a "line". How can I delete one of those "lines"?"

And honestly, I really believe that's what is being asked. So the following attempts to address that specific interpretation of this question.

Here's an example:
    set tmpname /tmp/something

    set source [open $filename]
    set destination [open $tmpname w]
    set content [read $source]
    close $source
    set lines [split $content \n]
    set lines_after_deletion \
       [lreplace $lines $line_number_to_remove $line_number_to_remove]
    puts -nonewline $destination [join $lines_after_deletion \n]
    close $destination
    file rename -force $tmpname $filename

Note, however, that in this case, what you are doing is reading the entire file and writing out the entire file, counting lines.

RS: Here is a case where awk is simpler than if we do it in Tcl. Say, you want all but the 4th line:
 gawk 'NR!=4' infile > outfile

sed is also a candidate:
 sed -n -e "4!p" infile > outfile

[Anyone know a more concise (-n-less?) way to achieve the same?]

AMG: Here ya go:
 sed 4d infile > outfile

You tried too hard. :^) This is even simpler than your awk version. Sed is a wonderful thing... too bad it's a write-only language. [1] [2]

To modify the original file, rather than creating a new file, do either of the following:
 sed 4d infile > tmpfile && mv tmpfile infile

 sed -i 4d infile

The latter only works with recent versions of GNU sed, but it can be quite handy, especially when dealing with multiple files. It internally uses temporary files, which is a good thing as I note below.

AMG: The last time someone asked this question in the Tcl'ers Chat, I wrote a sample program to do the job. I'd post it here but I didn't keep it... I guess I could write it a second time. Or I'll spell it out and let someone else translate to Tcl.

Anyway, my first thought was:

  1. [open] the file for reading and writing,
  2. Search for the line to delete using [gets] and [string equal], [regexp], etc.,
  3. Use [tell] to find the file position at the end of the matching line,
  4. [read] to the end of the file,
  5. [seek] back to the saved position,
  6. [puts -nonewline] the buffered data from the [read], and
  7. Truncate the file.

But at the time there was no way to truncate files, so I had to apologize for Tcl's inadequacy and change my approach:

  1. [open] the file for reading,
  2. Search for the line to delete,
  3. [append] each non-matching line to a buffer,
  4. Following a hit, [read] to the end of the file, and append this data to the buffer,
  5. [close] the file,
  6. Re-[open] the file for writing (using truncate mode "w"), and
  7. [puts -nonewline] the buffered data.

I don't like this version so much because it uses a larger buffer and overall it seems more fragile. But if multiple lines are to be deleted, or if more complicated transforms are called for, it may be more desirable.

But now there's a [chan truncate] command, so the first approach is viable! Yay.

IMPORTANT: Both approaches are fragile due to possible races with other processes accessing the same file. Between the start of step (6) and step (7), the file's contents are "incorrect" and will be either contain junk at the end or will be too short, in the case of the first and second approach, respectively. With approach #1, if C is to be deleted from ABCDEFG, the file will momentarily contain ABDEFGG before being truncated to ABDEFG. With approach #2, the file will be empty, contain A, AB, ABD, etc. before finally coming to rest at ABDEFG.

To fix this fragility, don't try to overwrite the file in place. Instead write to a temporary file. If all is well, atomically rename the new file on top of the original. This way the contents of the original file will be consistent at all times. The file's contents will atomically change from ABCDEFG to ABDEFG.

All of the above sed examples work this way, more or less.

LV Note that the above does leave open the possibility, unfortunately, of the file being in an unexpected state.

For example, Assume you have program a (simulating the line delete operation). During while program a is reading the old file, writing to the new file, program b opens the original file being processed and adds new data, etc. Without locking, the file after the line delete may, or may not, have the changes (depends on how a and b do their work). If program a renames the original file, then program b fails, as may other programs who only need read access. If program a locks the file, program b needs to check that lock, or the same problem exists. In other words, a generic program to delete a line isn't really going to work out well. All the programs accessing the file have to be programmed to play the same rules.

LV How does one atomically rename the new file on top of the original in Tcl?

JMN 'file rename -force source dest' should theoretically do this atomic renaming.

But be warned - on windows, certain versions of Tcl produce duplicate inode values as shown by 'file stat' - this can result in file rename -force silently failing.

see tcl bug: 2015723

LV I suppose it is too much to expect that one could somehow make use of exclusive opens cross-platform, isn't it? The page, "How do I manage lock files in a cross platform manner in Tcl" certainly illustrates that, whatever else, the answer isn't simple.

[VBM] I came up with the following solution which might be considered "quick and dirty" but it works well if you know the content that you want to remove but not the line number.
  set input [open filename.dat]
  set output [open filename.tmp w]
  while {[gets $input line] >= 0} { if {[lsearch $line $to_remove] < 0} { puts $output $line } }
  close $input
  close $output
  file rename -force filename.tmp filename.dat

the only problem I can foresee is if another process adds data to the original file before this one finishes and copies its temp file it creates, over the original file. In the situation I'm using this for it is extremely unlikely that this will ever happen... but your situations may vary.

WARNING! WARNING! The lsearch line above is a mis-use. I *think* VBM intended something more like
    ... if {[string first $to_remove $line] < 0} ...

LV I agree with whoever wrote the warning. The reason has to do with strings versus lists. When one does a [gets], one gets a string as a result. While some applications might have an external requirement that the input file be a Tcl properly quoted list, unless the data can be guaranteed to be a list, using the [string first] is safer.

[BRH]: AMG discussed the "fragility" of some of these techniques, and said, "All of the above sed examples work this way, more or less." The issue with commands at a UNIX command prompt is somewhat more subtle, and because it is not well understood is worthy of comment here.

As was mentioned, some versions of sed have an "edit in place" feature so a command like
 sed -i 4d infile

works as expected. In the absence of this feature, as was correctly mentioned above, you could write
 sed 4d infile > tmpfile && mv tmpfile infile

However, a novice might be tempted to try and take a shortcut, and write
 sed 4d infile > infile

to edit infile "in place." This is disastrous, not because of a potential race condition while the file is being edited in place, but because of the way UNIX command shells operate.

In general, a command at the command prompt is processed first by the command shell, and then the actual command gets to do its thing. In the example here,
 sed 4d infile > infile

the command shell sees the ">" output redirection character BEFORE sed starts to read and edit the file. The ">" redirection character tells the command shell to DELETE THE TARGET FILE (if it exists), and re-open the EMPTY FILE for writing. This all happens BEFORE sed gets a chance to work, so by the time sed finally is allowed to read the file, its contents have been lost. This is natural and normal, and all commands behave this way in a UNIX world.

AMG: Good point, thanks. I think the problem is that people read and write the command line left-to-right and therefore have some kind of unconscious expectation that it will execute left-to-right. In your example, it's reasonable to expect that sed reads the input before writing the output. However, even though this expectation is actually correct, the trouble is that the O_TRUNC file open happens first, despite the fact that the notation comes at the end of the line. If the shell syntax were to put the file redirections at the left side of the line (similar to how it does environment variable overrides),
infile < sed 4d infile

it would be more obvious that they take place before the child process (sed) is executed.