Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments=’#’, delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding=’bytes’, max_rows=None)

Numpy loadtxt() method is an efficient way to load data from text files where each row have distinct value counts.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. NumPy array from CSV file

We have a csv file with delhi rainfall data in millimeters for every months of year 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm

array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")

print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows

array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5
#%%
# Skip first row and first column

array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value file with any delimiters.

#%%
# Create array from gzipped csv

array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%

# Create array from tsv files

array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

4. Conclusion

In this tutorial we learned about key techniques to create NumPy array using data stored on plain text files like CSV, TSV etc. These methods are very handy while doing data exploration as well as developing program.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Leave a Comment

Your email address will not be published. Required fields are marked *