Basics of pandas DataFrame

Learn about the pandas DataFrame structure, how to create one, and its core attributes.

25 min read
Beginner

Basics of pandas DataFrame

Now that you know how to work with one-dimensional data using pandas Series, it is time to learn about pandas DataFrame to start working with two-dimensional data.


What is a pandas DataFrame?

A pandas DataFrame is a two-dimensional data structure that stores data as rows and columns just like in a spreadsheet or SQL Table. Here's an example of a DataFrame with 3 columns and 3 rows.

| | First Name | Last Name | Age | |---|---|---|---| | 0 | Will | Smith | 52 | | 1 | Michael | Jackson | 50 | | 2 | Kobe | Bryant | 41 |

The column of a pandas DataFrame is equivalent to a pandas Series and therefore, the columns can hold different data types (integer, float, boolean etc.). However, each column in a pandas DataFrame should be of equal length.


How to create a pandas DataFrame?

A pandas DataFrame can be created manually using a Python List, a NumPy array or a Python dictionary. You can also create a pandas DataFrame by reading in a CSV file, Excel file, etc. or by querying an SQL table. We will be looking at a few methods for creating a pandas DataFrame in the sections below.

The general syntax for creating a pandas DataFrame is by calling the DataFrame() method from pandas:

pandas.DataFrame(data)

Here, data can be any value that is array-like, iterable, dict, or a DataFrame itself. It contains the data to be stored in the DataFrame.

Creating a pandas DataFrame from a Python List

python
# Importing the pandas library as pd
import pandas as pd

# Initializing a Python list
lst = ["Python", "Java", "C", "Ruby"]

# Creating a pandas DataFrame from a Python List
df = pd.DataFrame(lst)

# Printing the DataFrame
print(df)

Although, the above table looks like a pandas Series in the given example, it is actually a DataFrame. You can verify it by using the type() method from Python or by looking at the shape of the DataFrame.

python
# Printing the type of the variable
print(type(df))

# Printing the shape of the variable
print(df.shape)

Moving on, a pandas DataFrame can also be created using a list of Python Lists since it is a two-dimensional data structure. Each column of the pandas DataFrame can be thought of as a list that gets converted into a pandas Series during initialization.

python
# Initializing multiple lists
lst1 = ["Apple", "1kg", 300]
lst2 = ["Orange", "2kg", 150]
lst3 = ["Mango", "5kg", 800]

# Creating a pandas DataFrame
df = pd.DataFrame(data=[lst1, lst2, lst3])

# Printing the DataFrame
print(df)

Here, we did not specify any index or column when creating our DataFrame. If you want to create a pandas DataFrame with meaningful index and columns, you may specify the index and column parameter during DataFrame initialization.

python
# Initializing multiple lists
lst1 = ["Apple", "1kg", 300]
lst2 = ["Orange", "2kg", 150]
lst3 = ["Mango", "5kg", 800]

# Creating a pandas DataFrame
df = pd.DataFrame(
    data=[lst1, lst2, lst3],
    index=["First", "Second", "Third"],
    columns=["Fruit", "Weight", "Price"],
)

# Printing the DataFrame
print(df)

Note: Creating a pandas DataFrame from a NumPy array is exactly similar to how we created a DataFrame using a Python List or a list of Python Lists.

Creating a pandas DataFrame from a Python Dictionary

A pandas DataFrame can be created using a pandas Dictionary where the keys are the column names and the values are the data values of the Python Dictionary.

python
# Creating a Python Dictionary
dictionary = {
    "Fruit": ["Apple", "Orange", "Mango"],
    "Weight": ["1kg", "2kg", "5kg"],
    "Price": [300, 150, 800],
}

# Creating a Pandas DataFrame
df = pd.DataFrame(dictionary)

# Printing the DataFrame
print(df)

Creating a pandas DataFrame from external sources

A pandas DataFrame can be created using external files through methods such as read_csv(), read_excel(), etc.

This capability of pandas to read in external sources as a pandas DataFrame is one of the many reasons why pandas is favored by many Python programmers to work with tabular data.

By the way, you can also save a pandas DataFrame in various file formats using methods such as to_csv(), to_excel(), etc.

Attributes of a pandas DataFrame

The attributes of a pandas DataFrame defines the intrinsic information of the DataFrame. The following table shows a list of commonly accessed pandas DataFrame attributes along with their meaning.

DataFrame Attributes
Attributes
Definition
axesReturn a list representing the axes of the DataFrame.
columnsThe column labels of the DataFrame.
dtypesReturn the dtypes in the DataFrame.
indexThe index (row labels) of the DataFrame.
shapeReturn a tuple representing the dimensionality of the DataFrame.
sizeReturn an int representing the number of elements in this object.
valuesReturn a Numpy representation of the DataFrame.
ilocPurely integer-location based indexing for selection by position.
locAccess a group of rows and columns by label(s) or a boolean array.

Consider the example given below that shows you the above-mentioned attributes of a pandas DataFrame,

python
# Initializing multiple lists
lst1 = ["Apple", "1kg", 300]
lst2 = ["Orange", "2kg", 150]
lst3 = ["Mango", "5kg", 800]

# Creating a pandas DataFrame
df = pd.DataFrame(
    data=[lst1, lst2, lst3],
    index=["First", "Second", "Third"],
    columns=["Fruit", "Weight", "Price"],
)

# Printing the DataFrame
print(df)

# Returns a list representing the axes of the DataFrame
print("Axes:", df.axes)

# Returns column labels of DataFrame
print("Columns:", df.columns)

# Returns the data type of columns in the DataFrame
print("Dtypes:\n", df.dtypes)

# The index (row labels) of the DataFrame
print("Index:", df.index)

# Returns a tuple representing the dimensionality of the DataFrame
print("Shape:", df.shape)

# Returns an int representing the number of elements in this object
print("Size:", df.size)

# Returns a Numpy representation of the DataFrame
print("Values:\n", df.values)

Now, let us also look at attributes such as iloc and loc that is used for selecting a data value in an index of the DataFrame.

iloc is used for purely integer-location based indexing for selection by position. Note that the indexing in pandas DataFrame starts from 0 since it is made up of one or more pandas Series.

python
# Accessing value at index=0
print("Value at index 0:\n", df.iloc[0])

# Accessing value at index=2
print("Value at index 2:\n", df.iloc[2])

# Accessing value at index=-2
print("Value at index -2:\n", df.iloc[-2])

You can also perform DataFrame slicing (similar to Pandas Series).

python
# Accessing values at index=0 to index=1
print("Values index 0 to 1:\n", df.iloc[0:2])

# Accessing values at index=0 to index=2 with a step of 2
print("Values index 0 to 2 step 2:\n", df.iloc[0:3:2])

Since pandas DataFrame contains two-dimensions, you can access the columns as well,

python
# Accessing element at index=0 in row axis and index=0 in column axis
print("Element (0,0):", df.iloc[0, 0])

# Accessing all elements in index=0 in column axis
print("Column 0:", df.iloc[:, 0])

# Accessing all elements in index=-2 in column axis
print("Column -2:", df.iloc[:, -2])

# Accessing all elements in index=0 to index=1 in column axis
print("Slice [0:2, 0:2]:\n", df.iloc[:2, :2])

Next, loc is used to access a group of rows and columns by label(s) or a boolean array

python
# Accessing row with label = "First"
print(df.loc["First"])

# Accessing row with label = "First" and label = "Second"
print(df.loc[["First", "Second"]])

You can also access both rows and columns using the loc attribute,

python
# Accessing rox axis label = "First" and column axis label = "Price"
print(df.loc["First", "Price"])

# Accessing multiple row and column axis labels
print(df.loc[["First", "Second"], ["Weight", "Price"]])

That's it for this lesson! Now you know what is a pandas DataFrame, how to create one and what are the attributes of a pandas DataFrame.