Auto generate NumPy array

1. Intro

We sometimes require to generate a NumPy array automatically for testing or other purposes. NumPy provides methods to create NumPy array with the same numerical values, values between two numbers and the identity matrix. In this tutorial, we will see code examples for most of the available methods. These methods are a great toolkit to decrease the time to quick scripting and prototyping.

There are other tutorials for creating NumPy array from an existing Python data structure and creating NumPy array from values kept in files.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. Creating an empty NumPy array

As we know NumPy array is stored as a contagious block in memory. When we add or remove rows or columns in an existing array, entire array copied to a new block in memory. It is inefficient and creates a gap in memory for new elements.

One of the optimisation technique, we can use in the case, when we are adding rows, is to define an array of a size which we anticipate it ends up being. We can create an empty NumPy array of specified size quickly using numpy.empty(size, dtype=int) method.

#%%
import numpy as np
print("Numpy Version is ", np.__version__)
#%%
# Create an empty array of size (2,3)
size = (2, 3)
print("Empty array of size (2,3)\n", np.empty(size, dtype=int))

OUTPUT:

Numpy Version is  1.15.4
Empty array of size (2,3)
 [[0 0 0]
 [0 0 0]]
Empty array of same size as a
 [[4607182418800017408 4607182418800017408]
 [4607182418800017408                   0]]

3. Creating a NumPy array of the same size of an existing array

We can use numpy.empty_like(an_existing_array) method to create an empty array of same size of an existing array. It is a handy tool for quickly creating another array.

# Create an empty array of same size as an existing array

# existing array
an_existing_array = np.array([[1, 2], [3, 4]])

print("Empty array of same size as an_existing_array\n", np.empty_like(an_existing_array))

OUTPUT

Empty array of size (2,3)
 [[0 0 0]
 [0 0 0]]
Empty array of same size as an_existing_array
 [[-1152921504606846976 -1152921504606846976]
 [                   8                    0]]

4. Creating a NumPy array with specified diagonal value

We can use numpy.eye(number_of_rows, number_of_cols, index_of_diagonal) method to generate an array of specified size with ones one diagonal and zeros elsewhere.

When index_of_diagonal is 0, one is used at primary diagonal. When index_of_diagonal is positive value upper diagonal is used, whereas for negative value lower diagonal.

#%%

# Create an array with 4 rows and 3 cols with 1 on diagonal and 0 on other places

number_of_rows = 4
number_of_cols = 3

# 0 for main diagonal, positive value as upper and negative value as lower diagonal
index_of_diagonal = 0
print(
    "4 by 3 array with 1 on diagonal \n",
    np.eye(number_of_rows, number_of_cols, index_of_diagonal),
)

# Lower diagonal example
index_of_diagonal = -1
print(
    "4 by 3 array with 1 on lower diagonal \n",
    np.eye(number_of_rows, number_of_cols, index_of_diagonal),
)

OUTPUT:

4 by 3 array with 1 on diagonal 
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 0.]]
4 by 3 array with 1 on lower diagonal 
 [[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
3 by 3 identity array 
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

5. Creating identity matrix

Identity matrix of size n is n x n square matrix with ones on main diagonal. We use numpy.identity(number_of_rows_and_cols) to create identity matrix.

# Identity Matrix of 3 x 3 size

number_of_rows_and_cols = 3

print("3 by 3 identity array \n", np.identity(number_of_rows_and_cols))
3 by 3 identity array 
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

6. Creating an array with specific values

We can use numpy.full(shape, fill_value) method to create an array of specified size and values.

#%%

# Create 3 x 4 size array full of value 7
shape = (3, 4)
fill_value = 7

print("3 x 4 array full of value 7\n", np.full(shape, fill_value))

OUTPUT:

3 x 4 array full of value 7
 [[7 7 7 7]
 [7 7 7 7]
 [7 7 7 7]]

We can use numpy.ones(shape) to quickly create an array full of one values. We can alternatively use numpy.full() method.

# Shortcut for an array with full of 1

print("3 x 4 array full of value 1\n", np.ones(shape))

OUTPUT:

3 x 4 array full of value 1
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

We can also use numpy.zeroes(shape) to quickly create an array full of zero values.

# Shortcut for an array with full of 0

print("3 x 4 array full of value 0\n", np.zeros(shape))

OUTPUT:

3 x 4 array full of value 0
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

7. Creating array with values between two numbers

Use numpy.arange(start, stop, step) method to generate a one-dimensional array with evenly spaced values between a range.

#%%

# Generate 1d array with values between two numbers with specified step
start = 10
stop = 20
step = 0.75

print("Array with values between two number\n", np.arange(start, stop, step))

OUTPUT:

Array with values between two number
 [10.   10.75 11.5  12.25 13.   13.75 14.5  15.25 16.   16.75 17.5  18.25
 19.   19.75]

Use numpy.linspace(start, stop, number_of_samples_to_generate, endpoint=False) to generate specified number of values between a range. Use endPoint=True|False to include and exclude the last value as stop.

number_of_samples_to_generate = 5
print(
    "Array with specified no of values between two n\nos\n",
    np.linspace(start, stop, number_of_samples_to_generate, endpoint=False)
)

OUTPUT:

Array with specified no of values between two nos
 [10. 12. 14. 16. 18.]

Use numpy.logspace(start, stop, number_of_samples_to_generate, endpoint=False) to generate specified number of values spaced evenly on log space between a range.

print(
    "Array with specified no of values spaced evenly on log space between two nos.\n",
    np.logspace(start, stop, number_of_samples_to_generate, endpoint=True)
)

OUTPUT:

Array with specified no of values spaced evenly on log space between two nos.
 [1.00000000e+10 3.16227766e+12 1.00000000e+15 3.16227766e+17
 1.00000000e+20]

Use numpy.geomspace(start, stop, number_of_samples_to_generate, endpoint=False) to generate the specified number of values spaced evenly on log space (geomatric progression) between a range.

print(
    "Array with specified no of values spaced evenly on log space (geomatric progression) between two nos.\n",
    np.geomspace(start, stop, number_of_samples_to_generate, endpoint=True)
)

OUTPUT:

Array with specified no of values spaced evenly on log space (geomatric progression) between two nos.
 [10.         11.89207115 14.14213562 16.81792831 20.        ]

8. Conclusion

In this tutorial, we learn several techniques to auto-generate NumPy arrays of various different values and shape.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Save NumPy array to file

1. Intro

We can learn about creating a NumPy array from plain text files like CSV, TSV in another tutorial. In this tutorial, we will see methods which help us in saving NumPy array on the file system. We can further use them to create a NumPy array.

Few techniques are critical for a data analyst, like saving array in .npy or .npz format. Creation time of NumPy array is very fast from .npy file format, compare to text files like CSV or other. Hence its advisable to save NumPy array in this format, if we wanted to refer them in future.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. Save NumPy array as plain text file like CSV

We can save a NumPy array as a plain text file like CSV or TSV. We tend to use this method when we wanted to share some analysis. Most of the analysis passes through multiple steps. Key stakeholders can see the end result with CSV files easily.

We can also provide custom delimiters.

We use numpy.savetxt() method to save a NumPy array as CSV or TSV file.

numpy.savetxt(fname, X, fmt=’%.18e’, delimiter=’ ‘, newline=’\n’, header=”, footer=”, comments=’# ‘, encoding=None)

#%%
# Saving NumPy array as a csv file
array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")
np.savetxt(fname="saved-rain-fall-row-col-names.csv", delimiter=",", X=array_rain_fall)

# Check generated csv file after loading it

array_rain_fall_csv_saved = np.loadtxt(
    fname="saved-rain-fall-row-col-names.csv", delimiter=","
)

print("NumPy array: \n", array_rain_fall_csv_saved)
print("Shape: ", array_rain_fall_csv_saved.shape)
print("Data Type: ", array_rain_fall_csv_saved.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Save and read NumPy Binary file

We can save the NumPy array as a binary file format using numpy_array.tofile() method. While it is not recommended for cross-machine use for archival and transfer, as it losses the precision and endiness information. It’s better to use .npy or .npz format for the archival and retrieving purpose.

We use numpy.fromfile() method to create a NumPy array from a binary file.

#%%

# Saving array as binary file and reading it

array_rain_fall.tofile("saved-rain-fall-binary")

array_rain_fall_binary = np.fromfile("saved-rain-fall-binary")

print("NumPy array: \n", array_rain_fall_binary)
print("Shape: ", array_rain_fall_binary.shape)
print("Data Type: ", array_rain_fall_binary.dtype.name)

OUTPUT:

NumPy array: 
 [12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5 13.  11.
 13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]
Shape:  (24,)
Data Type:  float64

4. Save and read npy file

We recommend developers to use .npy and .npz files to save NumPy array on disk for easy persistence and fast retrieval. Creating an array using .npy file is faster in comparison to CSV or plain text files.

We use numpy.save() method to save file in .npy format.

numpy.save(file, arr, allow_pickle=True, fix_imports=True)

We create NumPy array from .npy file using numpy.load() method.

numpy.load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, encoding=’ASCII’)

#%%

# Saving array as .npy and reading it

np.save("saved-rain-fall-binary.npy", array_rain_fall)

array_rain_fall_npy = np.load("saved-rain-fall-binary.npy")

print("NumPy array: \n", array_rain_fall_npy)
print("Shape: ", array_rain_fall_npy.shape)
print("Data Type: ", array_rain_fall_npy.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

5. Save multiple arrays in one npz file

NumPy provides numpy.savez() to save multiple arrays in one file. We can load the .npz file with numpy.load() method.

numpy.savez(file, *args, **kwds)

Combining several NumPy arrays into npz file, results in a faster load of NumPy arrays, comparing it with individual npy files.

#%%

# Saving multiple arrays in npz format. Loading and reading the array.

np.savez("saved-rain-fall-binary.npz", array_rain_fall, np.array([1, 2, 3, 4, 5]))

array_rain_fall_npz = np.load("saved-rain-fall-binary.npz")

print("NumPy array 1: \n", array_rain_fall_npz["arr_0"])
print("Shape of Array 1: ", array_rain_fall_npz["arr_0"].shape)
print("Data Type of Array 1: ", array_rain_fall_npz["arr_0"].dtype.name)

print("NumPy array 2: \n", array_rain_fall_npz["arr_1"])
print("Shape of Array 2: ", array_rain_fall_npz["arr_1"].shape)
print("Data Type of Array 2: ", array_rain_fall_npz["arr_1"].dtype.name)

OUTPUT:

NumPy array 1: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape of Array 1:  (2, 12)
Data Type of Array 1:  float64
NumPy array 2: 
 [1 2 3 4 5]
Shape of Array 2:  (5,)
Data Type of Array 2:  int64

We use numpy.savez_compressed() method to save compressed npz file.

6. Conclusion

This tutorial provides useful methods, which you use to optimize your NumPy code further. Save multiple arrays on disk and load them quickly to increase code efficiency and performance.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Save NumPy array to file

1. Intro

We can learn about creating a NumPy array from plain text files like CSV, TSV in another tutorial. In this tutorial, we will see methods which help us in saving NumPy array on the file system. We can further use them to create a NumPy array.

Few techniques are critical for a data analyst, like saving array in .npy or .npz format. Creation time of NumPy array is very fast from .npy file format, compare to text files like CSV or other. Hence its advisable to save NumPy array in this format, if we wanted to refer them in future.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. Save NumPy array as plain text file like CSV

We can save a NumPy array as a plain text file like CSV or TSV. We tend to use this method when we wanted to share some analysis. Most of the analysis passes through multiple steps. Key stakeholders can see the end result with CSV files easily.

We can also provide custom delimiters.

We use numpy.savetxt() method to save a NumPy array as CSV or TSV file.

numpy.savetxt(fname, X, fmt=’%.18e’, delimiter=’ ‘, newline=’\n’, header=”, footer=”, comments=’# ‘, encoding=None)

#%%
# Saving NumPy array as a csv file
array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")
np.savetxt(fname="saved-rain-fall-row-col-names.csv", delimiter=",", X=array_rain_fall)

# Check generated csv file after loading it

array_rain_fall_csv_saved = np.loadtxt(
    fname="saved-rain-fall-row-col-names.csv", delimiter=","
)

print("NumPy array: \n", array_rain_fall_csv_saved)
print("Shape: ", array_rain_fall_csv_saved.shape)
print("Data Type: ", array_rain_fall_csv_saved.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Save and read NumPy Binary file

We can save the NumPy array as a binary file format using numpy_array.tofile() method. While it is not recommended for cross-machine use for archival and transfer, as it losses the precision and endiness information. It’s better to use .npy or .npz format for the archival and retrieving purpose.

We use numpy.fromfile() method to create a NumPy array from a binary file.

#%%

# Saving array as binary file and reading it

array_rain_fall.tofile("saved-rain-fall-binary")

array_rain_fall_binary = np.fromfile("saved-rain-fall-binary")

print("NumPy array: \n", array_rain_fall_binary)
print("Shape: ", array_rain_fall_binary.shape)
print("Data Type: ", array_rain_fall_binary.dtype.name)

OUTPUT:

NumPy array: 
 [12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5 13.  11.
 13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]
Shape:  (24,)
Data Type:  float64

4. Save and read npy file

We recommend developers to use .npy and .npz files to save NumPy array on disk for easy persistence and fast retrieval. Creating an array using .npy file is faster in comparison to CSV or plain text files.

We use numpy.save() method to save file in .npy format.

numpy.save(file, arr, allow_pickle=True, fix_imports=True)

We create NumPy array from .npy file using numpy.load() method.

numpy.load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, encoding=’ASCII’)

#%%

# Saving array as .npy and reading it

np.save("saved-rain-fall-binary.npy", array_rain_fall)

array_rain_fall_npy = np.load("saved-rain-fall-binary.npy")

print("NumPy array: \n", array_rain_fall_npy)
print("Shape: ", array_rain_fall_npy.shape)
print("Data Type: ", array_rain_fall_npy.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

5. Save multiple arrays in one npz file

NumPy provides numpy.savez() to save multiple arrays in one file. We can load the .npz file with numpy.load() method.

numpy.savez(file, *args, **kwds)

Combining several NumPy arrays into npz file, results in a faster load of NumPy arrays, comparing it with individual npy files.

#%%

# Saving multiple arrays in npz format. Loading and reading the array.

np.savez("saved-rain-fall-binary.npz", array_rain_fall, np.array([1, 2, 3, 4, 5]))

array_rain_fall_npz = np.load("saved-rain-fall-binary.npz")

print("NumPy array 1: \n", array_rain_fall_npz["arr_0"])
print("Shape of Array 1: ", array_rain_fall_npz["arr_0"].shape)
print("Data Type of Array 1: ", array_rain_fall_npz["arr_0"].dtype.name)

print("NumPy array 2: \n", array_rain_fall_npz["arr_1"])
print("Shape of Array 2: ", array_rain_fall_npz["arr_1"].shape)
print("Data Type of Array 2: ", array_rain_fall_npz["arr_1"].dtype.name)

OUTPUT:

NumPy array 1: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape of Array 1:  (2, 12)
Data Type of Array 1:  float64
NumPy array 2: 
 [1 2 3 4 5]
Shape of Array 2:  (5,)
Data Type of Array 2:  int64

We use numpy.savez_compressed() method to save compressed npz file.

6. Conclusion

This tutorial provides useful methods, which you use to optimize your NumPy code further. Save multiple arrays on disk and load them quickly to increase code efficiency and performance.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments=’#’, delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding=’bytes’, max_rows=None)

Numpy loadtxt() method is an efficient way to load data from text files where each row have distinct value counts.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. NumPy array from CSV file

We have a csv file with delhi rainfall data in millimeters for every months of year 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm

array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")

print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows

array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5
#%%
# Skip first row and first column

array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value file with any delimiters.

#%%
# Create array from gzipped csv

array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%

# Create array from tsv files

array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

4. Conclusion

In this tutorial we learned about key techniques to create NumPy array using data stored on plain text files like CSV, TSV etc. These methods are very handy while doing data exploration as well as developing program.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments=’#’, delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding=’bytes’, max_rows=None)

Numpy loadtxt() method is an efficient way to load data from text files where each row have distinct value counts.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. NumPy array from CSV file

We have a csv file with delhi rainfall data in millimeters for every months of year 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm

array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")

print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows

array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5
#%%
# Skip first row and first column

array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value file with any delimiters.

#%%
# Create array from gzipped csv

array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%

# Create array from tsv files

array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12.  12.  14.  16.  19.  12.  11.  14.  17.  19.  11.  11.5]
 [13.  11.  13.5 16.7 15.  11.  12.  11.  19.  18.  13.  12.5]]
Shape:  (2, 12)
Data Type:  float64

4. Conclusion

In this tutorial we learned about key techniques to create NumPy array using data stored on plain text files like CSV, TSV etc. These methods are very handy while doing data exploration as well as developing program.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Create NumPy array from Python list and tuples

1. Intro

In this tutorial, we will learn various ways to create NumPy array from the Python structure like the list, tuple and others. It will be helpful in use cases where we want to leverage the power of NumPy operations on existing data structures.

Python 3.6.5 and NumPy 1.15 is used. Visual Studio Code 1.30.2 used to run iPython interactive codes.

2. One dimensional NumPy array from Python list

We will use numpy.array(object) method to create 1-dimensional NumPy array from the Python list. List contains integer values.

#%%
# Do some import
import numpy as np
print("Numpy Version is ", np.__version__)

#%%
# Creating 1 dimensional numpy array with Python list (int type)
one_d_list = [1, 2, 3, 4, 5]
array_one_dim_list = np.array(one_d_list)
print("NumPy array: ", array_one_dim_list)
print("Shape: ", array_one_dim_list.shape)
print("Data Type: ", array_one_dim_list.dtype.name)

OUTPUT

Numpy Version is  1.15.4

NumPy array:  [1 2 3 4 5]
Shape:  (5,)
Data Type:  int64

3. Two dimensional NumPy array from Python list

We will use numpy.array(object) method to create 2-dimensional NumPy array from the Python list. The list contains float values.

#%%

# Creating 2 dimensional numpy array with Python list (float type)
two_d_list = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]
array_two_d_py_list = np.array(two_d_list)
print("NumPy array: \n", array_two_d_py_list)
print("Shape: ", array_two_d_py_list.shape)
print("Data Type: ", array_two_d_py_list.dtype.name)

OUTPUT:

NumPy array: 
 [[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
Shape:  (3, 3)
Data Type:  float64

4. Three Dimensional NumPy array using Python list

We will use numpy.array(object) method to create 3-dimensional NumPy array from the Python list. The list contains String values.

Non-number values in NumPy array defies the purpose of it. However, it is possible to create String data type NumPy array.

#%%

# Creating 3 dimensional numpy array with Python 3d list (String)
three_d_list = [
    [["aa", "bb", "cc"], ["dd", "ee", "ff"], ["gg", "hh", "kk"]],
    [["ll", "mm", "nn"], ["oo", "pp", "qq"], ["rr", "ss", "tt"]],
]
three_d_array = np.array(three_d_list)
print("NumPy array: \n", three_d_array)
print("Shape: ", three_d_array.shape)
print("Data Type: ", three_d_array.dtype.name)

OUTPUT:

NumPy array: 
 [[['aa' 'bb' 'cc']
  ['dd' 'ee' 'ff']
  ['gg' 'hh' 'kk']]

 [['ll' 'mm' 'nn']
  ['oo' 'pp' 'qq']
  ['rr' 'ss' 'tt']]]
Shape:  (2, 3, 3)
Data Type:  str64

5. NumPy array using Python list with mix data type elements

We will use numpy.array(object) method to create a 1-dimensional NumPy array from the Python list. The list contains integer and float values.

#%%
# Creating NumPy array with Python list with mix data type elements
mix_data_type_list = [1.0, 2, 3.5, 4, 5.0]
mix_data_type_array = np.array(mix_data_type_list)
print("NumPy array: \n", mix_data_type_array)
print("Shape: ", mix_data_type_array.shape)
print("Data Type: ", mix_data_type_array.dtype.name)

OUTPUT:

NumPy array: 
 [1.  2.  3.5 4.  5. ]
Shape:  (5,)
Data Type:  float64

NumPy support homogeneous elements in array. While creating when NumPy found heterogeneous elements (float and int) Python list, it automatically convert the int elements to float.

NumPy cast an element to larger byte data type element while creating an array, It is called upcasting.

6. Specify data type while creating NumPy array

We will use the same Python array used in the previous code block to create new NumPy array. However, we will force a data type of NumPy array.

#%% 
# Creating NumPy array with Pythong list and specifying data type
fix_data_type_array = np.array(mix_data_type_list, np.int)
print("NumPy array: \n", fix_data_type_array)
print("Shape: ", fix_data_type_array.shape)
print("Data Type: ", fix_data_type_array.dtype.name)

OUTPUT:

NumPy array: 
 [1 2 3 4 5]
Shape:  (5,)
Data Type:  int64

We forced NumPy to make array elements of int data type by passing an argument to np.array(object, datatype) and it forced conversion of float into an int.

In absence of this parameter, NumPy should upcast int elements as a float.

7. Create a NumPy array with Python list and tuple

#%%

# Creating NumPy array with mix of List and Tuples upgraded

a_list = [1, 2.5, 3]
a_tuple = (1.5 , 2.3, 3)

two_d_list_tuple_array = np.array([a_list, a_tuple])
print("NumPy array: \n", two_d_list_tuple_array)
print("Shape: ", two_d_list_tuple_array.shape)
print("Data Type: ", two_d_list_tuple_array.dtype.name)

OUTPUT:

NumPy array: 
 [[1.  2.5 3. ]
 [1.5 2.3 3. ]]
Shape:  (2, 3)
Data Type:  float64

8. NumPy array with Jagged Python List

We will create NumPy array with jagged 2 Dimensional Python list and observe the outcome data type.

#%%

# Creating NumPy array with jagged 2 d Python list

jagged_two_d_list = [[1, 2, 3], [4, 5, 6], [7, 8]]

jagged_two_d_array = np.array(jagged_two_d_list)
print("NumPy array: \n", jagged_two_d_array)
print("Shape: ", jagged_two_d_array.shape)
print("Data Type: ", jagged_two_d_array.dtype.name)
NumPy array: 
 [list([1, 2, 3]) list([4, 5, 6]) list([7, 8])]
Shape:  (3,)
Data Type:  object

When we pass jagged array to numpy.array() method, NumPy create array with Object elements.

9. NumPy array with minimum dimension

We can enforce minimum dimension for NumPy array created using numpy.array(object = list, ndim = 3), even passed Python list is not of specified dimension.

#%%
# Create NumPy array with Python List and enforce minimum dimension

# We are using an already create one dimensional array one_d_list
array_enforced_three_dim_list = np.array(object=one_d_list, ndmin=3)
print("NumPy array: ", array_enforced_three_dim_list)
print("Shape: ", array_enforced_three_dim_list.shape)
print("Data Type: ", array_enforced_three_dim_list.dtype.name)

OUTPUT:

NumPy array:  [[[1 2 3 4 5]]]
Shape:  (1, 1, 5)
Data Type:  int64

NumPy pre-pend Ones to the shape as needed to meet the minimum dimensional requirement.

10. asarray method

NumPy asarray method doesn’t copy an object if not required, while array method copy object as their default option. While we can change the default behaviour by passing false to copy argument array(object = list, copy = false)

#%%
# Use asarray to create NumPy array

one_dim_list = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
one_dim_array_use_asarray = np.asarray(one_dim_list)
print("NumPy array: \n", one_dim_array_use_asarray)
print("Shape: ", one_dim_array_use_asarray.shape)
print("Data Type: ", one_dim_array_use_asarray.dtype.name)

OUTPUT:

NumPy array: 
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
Shape:  (2, 5)
Data Type:  int64

When Python list used, NumPy asarray had copied object, because they need to convert it into NumPy array.

But when we use one NumPy array to create new array, asarray doesn’t copy.

#%%
# array method by default make copy and hence change applied to copy
np.array(one_dim_array_use_asarray)[1]=1
print("NumPy array: \n", one_dim_array_use_asarray)

# asarray method don't make copy and hence change applied to one_dim_array_use_asarray
np.asarray(one_dim_array_use_asarray)[1]=1
print("NumPy array: \n", one_dim_array_use_asarray)

OUTPUT:

NumPy array: 
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
NumPy array: 
 [[1 2 3 4 5]
 [1 1 1 1 1]]

We saw how asarray don’t copy if not needed in contrast with array method behaviour.

11. Conclusion

We learned various ways to create NumPy array using Python List and Tuples.

Download source code related to this tutorial here

  1. Python – create-array-with-python-structures.py
  2. Jupyter – create-array-with-python-structures.ipynb

Run the Jupyter notebook for this tutorial.

12. References

  1. Various NumPy data types
  2. NumPy array(object) method
  3. NumPy array vs asarray

What is Data Structure?

Every algorithm which solves useful problem has data structure. Data structures are mechanism to keep data (primitive data type) arranged inside computer’s memory in some structure. They help algorithm to meet efficiency which otherwise not possible. It is not unusual to map computational problem into data structure at very first stage of problem solving. They are building blocks of modern computer programming model.

Students in classroom - real life data structure example
Students in the classroom

Let’s consider we want to arrange the name of students in alphabetical order using the program. When we formulate a solution we need a way to keep the list of students before we apply the sorting algorithm. Obviously, we will use an array (list) where we will keep them and use our algorithm to iterate over it and do sorting algorithm. We can see array and other important data structures are so indispensable and often used, they offered as core features of programming language.

Data Structure in real life

Dictionary book
Dictionary

In dictionary, words are sorted in alphabetical order and it enables to search word quickly. If words (data) are not structured in sorted order than it’s impossible to search any word. Dictionary is an example of data structure we use in real life. There are many other examples of real life data structure use, in fact, they are the inspiration to the computational data structure for a long time.

Abstract Data Types

Abstract data type (ADT) is a blueprint for data structure implementation. They specify operations to carry out as well as computational complexity for those operations. Data structures are a concrete implementation of Abstract Data Types. Abstract Data Type is a mathematical model which define the operation, values and behaviours of the data type in user’s perspective.

Integers

Integers are ADT with

values: .. -2, -1, 0, 1, 2, ..

Operations:  addition, subtraction, multiplication, division

as per general mathematics.

However computers are free to represent Integer as they want to.

Important Abstract Data Types

Container

Container is a collection of other objects.

Container Abstract Data Type
Container Abstract Data Type

Three important characteristics:

  1. Access – How we access the objects? It will be using index in case of Array, LIFO (Last in First Out) in case of Stacks and FIFO (First in First Out) in case of Queue.
  2. Storage – How we are storing the object in container?
  3. Traversal – How we are traversing the objects of the container?

Following operations on the Container Class desired:

  1. Create
  2. Insert
  3. Delete
  4. Delete all
  5. Access the object
  6. Access the objects count

All operations are self-explanatory. It is to explain the essence of Abstract Data Types. In real life Container have various types of implementation.

List

List has values and we can count the number of values. Same value may seem multiple times. List is a basic example of container.

List Abstract Data Type
List Abstract Data Type

Set

Set has distinct values not in a particular order.

Set Abstract Data Type
Set Abstract Data Type

Map

Map built of collection of pairs (Key, Value). Key appear only once in the collection. It is also known as associative array, symbol table, or dictionary.

Map Abstract Data Type
Map Abstract Data Type

Graph

Graph abstract data types made up of Nodes (also called vertices or points) and Edges (also called arcs or lines). Few operations are like adjacent of a node, neighbours of a node, adding node (vertex), removing node (vertex) or others.

Graph Abstract Data Type
Graph Abstract Data Type

Stack

Stack abstract data type is collection of elements with two important operations:

  1. Push – Adding element in collection.
  2. Pop – Removing element from collection. Elements removed from collection in Last In First Out (LIFO) order.

Stack abstract data type lifo
Stack abstract data type lifo

Queue

Queue abstract data type a collection of elements where element added in rear side (called Enqueue) and removed from front side (called Dequeue). It is also known as First In First Out  (FIFIO) data structure.

Queue FIFO abstract data type
Queue FIFO abstract data type

This is a brief introduction of Data Structure to get started our journey to understand it in deep. Please see the reference section to go through more details about terminologies used in this article.

Reference

  1. Data structure wikipedia article. Ref
  2. Abstract Data Type wikipedia article. Ref
  3. A very good introduction to data structure. Youtube Video Ref

15 reasons to try AWS DynamoDB as your NOSQL database

AWS DynamoDB Database as managed service
AWS DynamoDB Database as managed service

AWS DynamoDB is a managed NoSQL document database service. It’s a proprietary NoSQL database created by AWS. Amazon uses it on their eCommerce website. Hence its performance and scalability are proven.

I have used it in a high volume data project which needs more than 7000 writes per seconds and generating around 50 GBs of data daily. Though its an effort to design application with it but it scales really well.

In this article, you will see few good reasons for you to evaluate AWS DynamoDB when you are planning to use MongoDB, Cassandra or others alike.

AWS DynamoDB is a managed service

To keep database running is not a small job. If the size of data is in terabytes and growing, you need a team of infrastructure engineer to carry out following tasks:

  1. Architecture & design for a multi-region, multi-partition and redundant high-performance database.
  2. 24 X 7 monitoring of database nodes health.
  3. Database engine upgrade.
  4. OS upgrade.
  5. Regular disk and memory space planning, monitoring and implementation.
  6. Computational power planning, monitoring and implementation.
  7. Security audit & trail.
  8. On occasion Database node maintenance and replacement.

If we are using MongoDB or Cassandra and wanted to run the database with Terabytes of data we have to make sure all the above-mentioned tasks overlooked by an Infrastructure team.

Though AWS DynamoDB keeps you free from all the above tasks as it is a managed service. You just create tables and start pouring in data.

It helps in reducing database infrastructure management cost near to zero. It is one of the biggest selling points of it.

Even Petabytes of data is fine

AWS DynamoDB doesn’t have any limit on the size of tables, hence even Petabytes of data handled at the same performance. All the data kept on Solid State Drive servers.

Easy read and write throughput management

AWS DynamoDB is a true cloud database. It provides following options to manage read and write throughout elasticity:

  1. Auto Scale – With Auto Scale feature you have the ability to define increase and decrease read and write capacity of a table, when certain percentage or number of throughput capacity reached, AWS automatically increase or decrease the number of partitions require to handle the new throughput. It helps in reducing the cost by keeping the number of partitions optimal as per demand.
  2. Use the cron job to trigger the change in read and write throughput for the table using AWS CLI commands in the script.
  3. Manually change throughput from the management console.

Change in the table throughput results in creation or deletion of partitions. AWS make sure all these happens without any down time.

Automatic data and traffic management

DynamoDB automatically manage the replication and partition of the table based on the data size. It continuously monitor the table data size and spread tables on the sufficient number of servers which replicated to multiple availability zones in a region, when required. All these without any downtime and our knowledge.

On-demand backup & recovery for the table

DynamoDB will never lose your data because it replicates it in multiple zones of the system which are fault tolerant.

Keeping the backup of the table periodically can save our face when application corrupt data. In some corporate, there is a compliance need for the same. It provides simple admin console and API based backup and recovery mechanism. Backup and recovery are very fast and complete in seconds despite the size of the table.

Point in time recovery

DynamoDB provides the point in time recovery features to go back at any time in last 5 weeks (35 days) of time for a table. It is over and more to back up & recovery feature.

Multi-region global tables

AWS DynamoDB does automatic syncing of data between multiple regions for global tables. You just need to specify in which regions want it to be available. Without global tables, you were doing it on your own by executing code and copying data in multiple zones.

It is really helpful if application needs multi-region replication for performance reasons.

Inbuilt in-memory caching service DAX (DynamoDB Accelerator)

Caching improves the performance dramatically and cuts the load on database engine for read queries.

DynamoDB Accelerator (DAX) is an optional caching layer which you setup with few clicks. DAX is specially built cache layer to work with DynamoDB. You can use it against ElasticCache or self-hosted Redis because of its performance along with DynamoDB.

DynamoDB typically return the read queries under 100 milliseconds, with DAX it further improved and queries return under 10 milliseconds.

Encryption at rest

DynamoDB request response is HTTP based, just like many other NoSQL database. Encryption at rest is a feature provided to enable an extra layer of security for data to avoid the unauthorised access to storage. Sometime it required by compliance. It uses 256 bit AES encryption and encrypt table level data as well as indexes. It work seamlessly with AWS key management service for encryption key.

Document and key-value item storage

DynamoDB can store JSON document or key-value items in the table.

Schema-less

Like other NoSQL document database, DynamoDB is schema-less. The key attribute is only one mandatory attribute in the item.

Eventual and Immediate consistency

You can create the table in two consistency modes in DynamoDB.

Eventual consistency – The cheaper option, with the query may or may not make the latest item available.

Immediate consistency – If your application wants immediate consistency with query result should always give the latest items.

Time to live items

This is one of the power features of DynamoDB which enables many use cases not possible without writing custom application code. You can get your items deleted after a certain amount of time automatically by a sweeper.

Streams

DynamoDB streams is another powerful feature which enables execution of AWS Lambda function when item created, updated or deleted. Streams are similar to AWS Kinesis stream and you can use it for many use cases. For E.g. create your own data pipeline for creating aggregated records like average, sum etc. Or sending the email when a new user record inserted.

Local DynamoDB setup

For ease of development and integration test, you can use DynamoDB local distribution. It is Java application and it can with Java Runtime Environment installed in the environment.

One last thing which I have not highlighted but important. Being a part of AWS cloud offering, It can easily integrate with AWS Athena for big data computation need. However, you can always integrate it with Apache Spark or other Big data computation engine.

I will suggest you to try DynamoDB as your NOSQL need, let see if it fits your need. They provide generous free tier to start with.

References

  1. Official AWS documentation for developers 

15 reasons to try AWS DynamoDB as your NOSQL database

AWS DynamoDB Database as managed service
AWS DynamoDB Database as managed service

AWS DynamoDB is a managed NoSQL document database service. It’s a proprietary NoSQL database created by AWS. Amazon uses it on their eCommerce website. Hence its performance and scalability are proven.

I have used it in a high volume data project which needs more than 7000 writes per seconds and generating around 50 GBs of data daily. Though its an effort to design application with it but it scales really well.

In this article, you will see few good reasons for you to evaluate AWS DynamoDB when you are planning to use MongoDB, Cassandra or others alike.

AWS DynamoDB is a managed service

To keep database running is not a small job. If the size of data is in terabytes and growing, you need a team of infrastructure engineer to carry out following tasks:

  1. Architecture & design for a multi-region, multi-partition and redundant high-performance database.
  2. 24 X 7 monitoring of database nodes health.
  3. Database engine upgrade.
  4. OS upgrade.
  5. Regular disk and memory space planning, monitoring and implementation.
  6. Computational power planning, monitoring and implementation.
  7. Security audit & trail.
  8. On occasion Database node maintenance and replacement.

If we are using MongoDB or Cassandra and wanted to run the database with Terabytes of data we have to make sure all the above-mentioned tasks overlooked by an Infrastructure team.

Though AWS DynamoDB keeps you free from all the above tasks as it is a managed service. You just create tables and start pouring in data.

It helps in reducing database infrastructure management cost near to zero. It is one of the biggest selling points of it.

Even Petabytes of data is fine

AWS DynamoDB doesn’t have any limit on the size of tables, hence even Petabytes of data handled at the same performance. All the data kept on Solid State Drive servers.

Easy read and write throughput management

AWS DynamoDB is a true cloud database. It provides following options to manage read and write throughout elasticity:

  1. Auto Scale – With Auto Scale feature you have the ability to define increase and decrease read and write capacity of a table, when certain percentage or number of throughput capacity reached, AWS automatically increase or decrease the number of partitions require to handle the new throughput. It helps in reducing the cost by keeping the number of partitions optimal as per demand.
  2. Use the cron job to trigger the change in read and write throughput for the table using AWS CLI commands in the script.
  3. Manually change throughput from the management console.

Change in the table throughput results in creation or deletion of partitions. AWS make sure all these happens without any down time.

Automatic data and traffic management

DynamoDB automatically manage the replication and partition of the table based on the data size. It continuously monitor the table data size and spread tables on the sufficient number of servers which replicated to multiple availability zones in a region, when required. All these without any downtime and our knowledge.

On-demand backup & recovery for the table

DynamoDB will never lose your data because it replicates it in multiple zones of the system which are fault tolerant.

Keeping the backup of the table periodically can save our face when application corrupt data. In some corporate, there is a compliance need for the same. It provides simple admin console and API based backup and recovery mechanism. Backup and recovery are very fast and complete in seconds despite the size of the table.

Point in time recovery

DynamoDB provides the point in time recovery features to go back at any time in last 5 weeks (35 days) of time for a table. It is over and more to back up & recovery feature.

Multi-region global tables

AWS DynamoDB does automatic syncing of data between multiple regions for global tables. You just need to specify in which regions want it to be available. Without global tables, you were doing it on your own by executing code and copying data in multiple zones.

It is really helpful if application needs multi-region replication for performance reasons.

Inbuilt in-memory caching service DAX (DynamoDB Accelerator)

Caching improves the performance dramatically and cuts the load on database engine for read queries.

DynamoDB Accelerator (DAX) is an optional caching layer which you setup with few clicks. DAX is specially built cache layer to work with DynamoDB. You can use it against ElasticCache or self-hosted Redis because of its performance along with DynamoDB.

DynamoDB typically return the read queries under 100 milliseconds, with DAX it further improved and queries return under 10 milliseconds.

Encryption at rest

DynamoDB request response is HTTP based, just like many other NoSQL database. Encryption at rest is a feature provided to enable an extra layer of security for data to avoid the unauthorised access to storage. Sometime it required by compliance. It uses 256 bit AES encryption and encrypt table level data as well as indexes. It work seamlessly with AWS key management service for encryption key.

Document and key-value item storage

DynamoDB can store JSON document or key-value items in the table.

Schema-less

Like other NoSQL document database, DynamoDB is schema-less. The key attribute is only one mandatory attribute in the item.

Eventual and Immediate consistency

You can create the table in two consistency modes in DynamoDB.

Eventual consistency – The cheaper option, with the query may or may not make the latest item available.

Immediate consistency – If your application wants immediate consistency with query result should always give the latest items.

Time to live items

This is one of the power features of DynamoDB which enables many use cases not possible without writing custom application code. You can get your items deleted after a certain amount of time automatically by a sweeper.

Streams

DynamoDB streams is another powerful feature which enables execution of AWS Lambda function when item created, updated or deleted. Streams are similar to AWS Kinesis stream and you can use it for many use cases. For E.g. create your own data pipeline for creating aggregated records like average, sum etc. Or sending the email when a new user record inserted.

Local DynamoDB setup

For ease of development and integration test, you can use DynamoDB local distribution. It is Java application and it can with Java Runtime Environment installed in the environment.

One last thing which I have not highlighted but important. Being a part of AWS cloud offering, It can easily integrate with AWS Athena for big data computation need. However, you can always integrate it with Apache Spark or other Big data computation engine.

I will suggest you to try DynamoDB as your NOSQL need, let see if it fits your need. They provide generous free tier to start with.

References

  1. Official AWS documentation for developers 

React – Hello World example

In this article we will learn how a react UI wired in an HTML page (an index page) and display Hello World text in a paragraph. We had already learned about basic setup, JSX and ES6 concept in earlier tutorial. This tutorial is very basic and don’t expect much here in terms of advance features, we will learn progressively further in this tutorial series.

Tools & Libraries used:

  1. React 16.x
  2. NodeJS 6.x
  3. NPM 5.x

Get the source code for this tutorial using any one of the way defined below:

  1. Download React tutorial series source code zip. Unzip it. You can find this tutorial source code in 01-hello-world folder.
  2.  git clone https://github.com/geekmj/react-examples.git

Note: This tutorial tested on a Mac OS V10.13 machine. For developers using windows machine, command syntax will change a bit.

Files & Folders Structure

Hello World project folder structure
Hello World project folder structure

In our project structure we have abstraction between how we keep source code for development and they actually get distributed for deployment on web servers using build tool.

NPM dependency and build file

We are using NPM for client side libraries dependency management and package.json source is given below:

{
  "name": "01-hello-world",
  "version": "0.1.0",
  "private": true,
  "dependencies": {
    "react": "^16.0.0",
    "react-dom": "^16.0.0",
    "react-scripts": "1.0.17"
  },
  "scripts": {
    "start": "react-scripts start",
    "build": "react-scripts build",
    "test": "react-scripts test --env=jsdom",
    "eject": "react-scripts eject"
  }
}

index HTML file

index.html file which is initialisation point for Single Page App which have one div with id root within body.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <meta name="theme-color" content="#000000">
    <!--
      manifest.json provides metadata used when your web app is added to the
      homescreen on Android. See https://developers.google.com/web/fundamentals/engage-and-retain/web-app-manifest/
    -->
    <link rel="manifest" href="%PUBLIC_URL%/manifest.json">
    <link rel="shortcut icon" href="%PUBLIC_URL%/favicon.ico">
    <!--
      Notice the use of %PUBLIC_URL% in the tags above.
      It will be replaced with the URL of the `public` folder during the build.
      Only files inside the `public` folder can be referenced from the HTML.

      Unlike "/favicon.ico" or "favicon.ico", "%PUBLIC_URL%/favicon.ico" will
      work correctly both with client-side routing and a non-root public URL.
      Learn how to configure a non-root public URL by running `npm run build`.
    -->
    <title>React App</title>
  </head>
  <body>
    <noscript>
      You need to enable JavaScript to run this app.
    </noscript>
    <div id="root"></div>
    <!--
      This HTML file is a template.
      If you open it directly in the browser, you will see an empty page.

      You can add webfonts, meta tags, or analytics to this file.
      The build step will place the bundled scripts into the <body> tag.

      To begin the development, run `npm start` or `yarn start`.
      To create a production bundle, use `npm run build` or `yarn build`.
    -->
  </body>
</html>

index JS file

index JS file is starting point for React application.

import React from 'react';
import ReactDOM from 'react-dom';

ReactDOM.render(<p>Hello world</p>, document.getElementById('root'));

We had included React and ReactDOM .

We are using ReactDOM.render method which takes two arguments, first JSX expression for simple UI code and second HTML DOM element where React will place the UI.

Running the application in development mode

In a terminal, go to project folder 01-hello-world and run following commands

$ npm start

Compiled successfully!

You can now view 01-hello-world in the browser.

  Local:            https://localhost:3000/
  On Your Network:  https://10.11.2.101:3000/

Note that the development build is not optimized.
To create a production build, use yarn build.

Note: IP in On Your Network will be your system IP.

You will see page like below when you access the URL:

Hello World Project Web Page
Hello World Project Web Page

Build application for deployment on web server

In a terminal, go to project folder 01-hello-world and run following commands:

$ npm run build

> 01-hello-world@0.1.0 build /Volumes/Drive2/projects/project_workspace/github/react-examples/01-hello-world
> react-scripts build

Creating an optimized production build...
Compiled successfully.

File sizes after gzip:

  39.08 KB  build/static/js/main.12cfb16d.js

The project was built assuming it is hosted at the server root.
To override this, specify the homepage in your package.json.
For example, add this to build it for GitHub Pages:

  "homepage" : "https://myname.github.io/myapp",

The build folder is ready to be deployed.
You may serve it with a static server:

  yarn global add serve
  serve -s build

Build folder contain web server deployable package and it looks like below:

Node app production build folder structure
Node app production build folder structure

Build script created a package optimised for production deployment. It joins many js files in one and minified them for better performance. So, you can serve it from any web server.

And that’s it. Enjoy.

Download Icon SmallDownload the source code Github IconFollow Project On Github

References

  1. React Setup and getting started 
  2. NPM & Front End Packaging
  3. Webpack module bundle concepts
  4. NodeJS and NPM installation guide
  5. Babel for enabling ES6
  6. Hello World Example At React JS documentation

Note: While writing this article, I am working as Senior Technical Architect with Magic Software Inc.