This follows the cut_bytes() approach of letting read_line() create a
buffer and find the newline. read_line() guarantees our buffer is a
string of utf8 characters.
When writing out the bytes segment we need to make sure we are cutting
on utf8 boundaries, there for we must iterate over the buffer
from read_line(). This implementation is(/should be) efficient as it
only iterates once over the buffer.
The previous performance was about 4x as slow as cut_bytes() and now it
is about 2x as slow as cut_bytes().
Do no longer iterate over each byte and instead rely on the Buffer trait
to find the newline for us. Iterate over the ranges to specify slices of
the line which need to be printed out.
This rewrite gives a signifcant performance increase:
Old: 1.32s
mahkoh: 0.90s
New: 0.20s
GNU: 0.15s
This implementation uses rust's concept of characters and fails if the
input isn't valid utf-8. GNU cut implements '--characters' as an alias
for '--bytes' and thus has different semantics, for this option, from
this implemtation.
The following are changes to fix#303:
1. hashsum pulls 512KB chunks of the file into memory. This ends up taking 1MB with
a secondary buffer allocated for windows. hashsum is now able to hash files larger
than the computer's available memory.
2. Text no longer transforms to UTF-8. This allows hashing to work on binary files
without specifying text mode. On Windows, it converts a Windows newline '\r\n' to
the standard newline '\n'.
3. Set default modes: Windows uses binary by default, all other systems use text.
Gil Cottle <gcottle@redtown.org>
* Changed line verifications to use regular expressions.
* Added binary marker to output and start using the marker from
the check file line as input to calc_sum
* Convert characters to lowercase before comparison in check
Gil Cottle <gcottle@redtown.org>