Tuesday, 28 January 2020

python read write large file by chunks

When uploading files from pc to server or downloading from server to pc, loading large file directly into pc memory may exceed memory capacity, alternative is to break large file into small chunks. Each time load a chunk into memory, store the chunk on server, clear the memory and load next chunk. To retrieve file from server, each time load a chunk from server to memory and concatenate to the file on local pc. When all chunks are assembled, complete file is on disk.    
read a.txt, breaks it into 30 1kb chunks, pickle them,
 and store in ordered names in picked_chunks folder
load pickled chunks, read them by file name order, 
and write to b.txt one by one

30 * 1kb chunks = 30kb
a.txt

b.txt
unrecognized symbols from a.txt is replaced with ?

import pickle
import glob
import os.path

def read_in_chunks(file_object, chunk_size=1024):

    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


def store_chunks():
    with open('a.txt', 'r', errors='replace') as file:

        for i, piece in enumerate(read_in_chunks(file)):

            chunkName = str(i) + '.pickle'
            with open(os.path.join('pickled_chunks',  chunkName), 'wb') as chunk:
                pickle.dump(piece, chunk)


#store_chunks()

def assemble_chunks():
    chunks = glob.glob(os.path.join('pickled_chunks', '*.pickle'))

    with open('b.txt', 'w', errors='replace') as file:
        for i in range(len(chunks)):
            chunkName = str(i) + '.pickle'

            with open(os.path.join('pickled_chunks',  chunkName), 'rb') as pickled_chunk:
                file_chunk = pickle.load(pickled_chunk)
                file.write(file_chunk)


assemble_chunks()

reference:
http://chuanshuoge2.blogspot.com/2020/01/python-yield.html

glob
https://www.geeksforgeeks.org/how-to-use-glob-function-to-find-files-recursively-in-python/

No comments:

Post a Comment