NumPy array append slow

For a university project I need to implement an inverted index in python. This consists of a dictionary containing NumPy arrays, to which I append new values as I process more text. We've been told to use NumPy instead of regular Python lists, to decrease memory usage as it is possible to use smaller defined datatypes. However using list append and then swapping the value in the dictionary proves to be terribly slow. Is there a better way to implement this?

Comments

this problem arises frequently for people who have no experience with C or C++ programming languages. As most of NumPy is written in C, it has some drawbacks that arose from that. But it is good that you are spreading this knowledge, as not everyone knows this.
Ondrej Brichta - Mon, 12/20/2021 - 10:53 :::
Had the Same issue while working with Python, i switched to Pandas instead of numpy.
Moritz Ruoff Holzer - Wed, 12/22/2021 - 06:27 :::
1 answer

The problem you're experiencing comes from the fact that array.append, as implemented in NumPy, actually creates a new copy of the array every time called, which must allocated in the memory completely new, especially if called often and on large arrays this should definitely be avoided. There are two solutions to solve this issue:

1. If you don't know how big the arrays are going to be, you can use python lists during the processing and then convert them to NumPy arrays afterwards. Python lists alter in memory and are therefore much more performant for using append. The downside of this is that you have larger memory usage, as you can't use NumPys more efficient data types.

2. If you know how big the arrays are going to be you can preallocate the whole array, by using the zeros method of NumPy and create sort of an empty array which you then add the values to as you progress. This may requires changing of your processing algorithm and therefore could be slower, but it can be way more memory efficient.

Comments

Thank you for this explanation! I had the same problem in a course this autumn & your explanation help alot

Mathilda Moström - Thu, 12/16/2021 - 10:22 :::

Thanks for bringing this up, as you explained the list filling process stays within the list itself, and no new lists are generated, that is why lists can be faster. However, as you said depending on the data size sometimes NumPy arrays can act faster too.

Parinaz Momeni ... - Wed, 12/22/2021 - 00:04 :::