python读取超大文件 Python读取大文件(GB)

最近处理文本文档时（文件约2GB大小），出现memoryError错误和文件读取太慢的问题，后来找到了两种比较快Large File Reading 的方法，本文将介绍这两种读取方法。

我们谈到“文本处理”时，我们通常是指处理的内容。Python 将文本文件的内容读入可以操作的字符串变量非常容易。文件对象提供了三个“读”方法： .read()、.readline() 和 .readlines()。每种方法可以接受一个变量以限制每次读取的数据量，但它们通常不使用变量。 .read() 每次读取整个文件，它通常用于将文件内容放到一个字符串变量中。然而 .read() 生成文件内容最直接的字符串表示，但对于连续的面向行的处理，它却是不必要的，并且如果文件大于可用内存，则不可能实现这种处理。下面是read()方法示例：

Python try: f = open('/path/to/file', 'r') print f.read() finally: if f: f.close()

1 2 3 4 5 6 7

try: f = open('/path/to/file', 'r') print f.read() finally: if f: f.close()

　　调用read()会一次性读取文件的全部内容，如果文件有10G，内存就爆了，所以，要保险起见，可以反复调用read(size)方法，每次最多读取size个字节的内容。另外，调用readline()可以每次读取一行内容，调用readlines()一次读取所有内容并按行返回list。因此，要根据需要决定怎么调用。
　　如果文件很小，read()一次性读取最方便；如果不能确定文件大小，反复调用read(size)比较保险；如果是配置文件，调用readlines()最方便：

Python for line in f.readlines(): process(line) # <do something with line>

1 2 3

for line in f.readlines(): process(line) # <do something with line>

Read In Chunks

　　处理大文件是很容易想到的就是将大文件分割成若干小文件处理，处理完每个小文件后释放该部分内存。这里用了 iter & yield：

Python def read_in_chunks(filePath, chunk_size=1024*1024): """ Lazy function (generator) to read a file piece by piece. Default chunk size: 1M You can set your own chunk size """ file_object = open(filePath) while True: chunk_data = file_object.read(chunk_size) if not chunk_data: break yield chunk_data if __name__ == "__main__": filePath = './path/filename' for chunk in read_in_chunks(filePath): process(chunk) # <do something with chunk>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

def read_in_chunks(filePath, chunk_size=1024*1024): """ Lazy function (generator) to read a file piece by piece. Default chunk size: 1M You can set your own chunk size """ file_object = open(filePath) while True: chunk_data = file_object.read(chunk_size) if not chunk_data: break yield chunk_data if __name__ == "__main__": filePath = './path/filename' for chunk in read_in_chunks(filePath): process(chunk) # <do something with chunk>

Using with open()

　　with语句打开和关闭文件，包括抛出一个内部块异常。for line in f文件对象f视为一个迭代器，会自动的采用缓冲IO和内存管理，所以你不必担心大文件。

Python #If the file is line based with open(...) as f: for line in f: process(line) # <do something with line>

1 2 3 4 5

#If the file is line based with open(...) as f: for line in f: process(line) # <do something with line>

Conclusion

　　在使用python进行大文件读取时，应该让系统来处理，使用最简单的方式，交给解释器，就管好自己的工作就行了。

zeropython 微信公众号 5868037 QQ号 5868037@qq QQ邮箱

更多推荐

python读取超大文件 Python读取大文件(GB)

python读取超大文件 Python读取大文件(GB)

Read In Chunks

Using with open()

Conclusion

发布评论取消回复

最近发表

热门文章

标签列表

python读取超大文件 Python读取大文件(GB)

Read In Chunks

Using with open()

Conclusion

相关文章

发布评论取消回复

最近发表

热门文章

标签列表