Skip to content

fprimex/filecleaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FileCleaver

Cleave (split) files (and streams, etc) into chunks.

Installing

Should work with Python 2 and 3.

pip install filecleaver

You may wish to use easy_install instead of pip, and/or use a virtualenv.

What is this?

Grep, awk, split... MapReduce?

The motivation for this module came out of a job interview question that dug deep in my mind like a splinter. First, I was asked how I would retrieve certain information from a certain kind of text file format. Easy enough; I am quite adept at shell scripting. Some grep and awk and it’s no problem.

Next, I was asked what if the file was very large, then what? I gave a one word response, initially, of “MapReduce”, and added “you know, multiple threads or processes or whatever.” No problem!

But, there is kind of a problem. Splitting a file on disk to feed into a MapReduce style of coding could be nearly as challenging and resource consuming as the problem itself. There isn’t a UNIX utility that will efficiently split a file for you (and I love shell scripting, which is why I was so interested in this), at least not on line boundaries that other utilities could then read and make sense out of. If you look at split(1), you’ll see options for splitting a file based on a number of lines, but this actually starts at the beginning of the file, reads that number of lines, and then writes a chunk, and repeats. You’d be better off doing your processing while that line-by-line read was happening.

So, what we need is a fast way to take a file off disk, chunk it up, and send the bits to different processes. The split doesn’t need to be exact, just close enough that all processes are doing similar amounts of meaningful work.

Seeking and finding

The fastest way to move through files is to forget about line boundaries and just move the file pointer any number of bytes forward. If you then just read to the end of the line that you land in the middle of, you’ve found the nearest next line ending.

This is nothing special. I am betting that this strategy has been repeated countless times in every existing programming language that has touched parallel storage. I am just interested in seeing an actual, working implementation in Python to finish answering the interview question thoroughly (even though I got the job already).

This is what filecleaver does. You can decide you want N chunks, or you can ask for chunks of M bytes. Filecleaver then seeks in your file (or stream or whatever) and hands you back a list of easy to use file-like objects called FileChunks. There are many utilities that do this kind of work, including split(1), however the special thing about filecleaver is that you get chunks back on line boundaries quickly that will be approximately the size requested.

Whether this is actually any faster than just reading the file in one process through one time depends on your storage hardware probably.

Examples

Split a file into 10 chunks:

from __future__ import print_function
from filecleaver import cleave

filename = 'words'
readers = cleave(filename, 10)

for i, reader in enumerate(readers):
    with reader.open() as src, open('out{:02}'.format(i), 'wb') as dst:
        print('Chunk #{} was {} bytes'.format(i, src.end - src.start))
        dst.write(src.read())

Or you can use readline, readlines, or use the FileChunk object as an iterator:

for i, reader in enumerate(readers):
    with reader.open() as src, open(‘out{:02}’.format(i), ‘wb’) as dst:
        for line in src:
            dst.write(line)

for i, reader in enumerate(readers):
    with reader.open() as src, open(‘out{:02}’.format(i), ‘wb’) as dst:
        dst.writelines(src.readlines())

Split a file into chunks of size 109088170 bytes or larger (up to next line ending):

from __future__ import print_function
from filecleaver import cleavebytes

filename = 'words'
readers = cleavebytes(filename, 109088170)

for i, reader in enumerate(readers):
    with reader.open() as src, open('out{:02}'.format(i), 'wb') as dst:
        print('Chunk #{} was {} bytes'.format(i, src.end - src.start))
        dst.write(src.read())

Split a file into chunks and then process each chunk with multiprocessing:

from __future__ import print_function

import multiprocessing

from filecleaver import cleavebytes

def writechunk(i, reader):
    with reader.open() as src, open('out{}'.format(i), 'wb') as dst:
        dst.write(src.read())
        return 'Chunk #{} was {} bytes'.format(i, src.end - src.start)

def writechunk_cb(result):
    print(result)

filename = 'words'
readers = cleavebytes(filename, 109088170)
tasks = []
pool = multiprocessing.Pool()

for i, reader in enumerate(readers):
    tasks.append((i, reader))

for task in tasks:
    pool.apply_async(writechunk, args=task, callback=writechunk_cb)

pool.close()
pool.join()

Using .read() is definitely the fastest way to go if you’re just reading chunks and not processing each line. So, one alternative usage of filecleaver would be to easily iterate over file chunks that are close to the size that you want to consume in each iteration. You can, of course, do this with the normal file API, seek, and read, all by yourself, but maybe you’ll like using this class to get it done since it makes things easy.

About

Python module for splitting files based on file sizes, where the splits are on line endings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages