Python: Pandas DataFrame data manipulation

joushe info
2022-04-12
0

...

We are in an era where a large amount of data is handled in any branch, be it in engineering, medicine, or business. For this reason, there is a need to manipulate data, either to import the initial data or to extract the final result. Python helps you automate the respective data changes in an efficient way. In Python, there are two powerful libraries for this purpose which will be discussed below.

We are in an era where a large amount of data is handled in any branch, be it in engineering, medicine, or business. For this reason, there is a need to manipulate data, either to import the initial data or to extract the final result. Python helps you automate the respective data changes in an efficient way. In Python, there are two powerful libraries for this purpose which will be discussed below.

The library pandas allows visualizing a vast amount of data in a very nice way. Data are arranged as in an Excel sheet. For this exercise, we will use the text file named 'vaccinations.txt' from the previous exercise. Let's get started!

Just to remind you, in the previous tutorial (please go to Python: Downloading data from the web), we downloaded a file from the web, reading line by line, and finally, created a text file named 'vaccinations.txt', containing a vast amount of data about covid vaccination in the world. This file will be used in this exercise. A very short part of it looks like this:

As seen from the figure above, the columns are separated by the 'comma' (',') delimiter. Even though the text file has a header (the first row), we will write our own row header with different names. Then, first of all, the text file should be imported into Python using the Python function pd.read_csv. Do not forget to import the pandas library.

#Importing library
import pandas as pd


#Importing the text file df = pd.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3)
#Printing the DataFrame print(df)

Please note that when importing the pandas library, we give it the nickname 'pd'. This helps to make the code shorter every time we call it. If you do not give a nickname to the imported library, then you should write:

df = pandas.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3).

Since the file has a lot of rows, we can display either the first five or the last five rows by typing:

#Getting the 5 first rows of the DataFrame
df.head()


#Getting the 5 last rows of the DataFrame df.tail()

Now, we will create our own header.

#Setting a header for the DataFrame
df.columns = ['Country','Country iso code','Date','Total vaccinations','People vaccinated','People fully vaccinated','Total boosters','Daily vaccinations raw','Daily vaccinations','Total vaccinations per hundred','People vaccinated per hundred','People fully vaccinated per hundred','Total boosters per hundred','Daily vaccinations per million','Daily people vaccinated','Daily people vaccinated per hundred']

We can also make a copy of the DataFrame in order not to modify the original one.

#Copying the DataFrame to another DataFrame
df_copy = df.copy()

Now, it is time to show you how we can remove one or several columns. For this, we will use 2 methods: the first one with the function del, and the second one using the function drop. If using the second method, we can delete the columns by either axis or column parameter.

#del function for removing one column only
del df_copy['Daily vaccinations']


#Printing the DataFrame print(df_copy)

#drop() function for removing one or more columns


#Method 1: Using axis parameter df_copy_drop1 = df_copy.drop(['Total vaccinations','People vaccinated','Country iso code'], axis=1) print(df_copy_drop1)
#Method 2: Using columns parameter df_copy_drop2 = df_copy.drop(columns=['Total vaccinations','People vaccinated','Country iso code']) print(df_copy_drop2)

Please notice that the original DataFrame df has 16 columns. When using the del function, the DataFrame df_copy has 15 columns. If using the drop function either by axis or column parameter, we get the same output: a DataFrame with 12 columns.

In a similar way, it is also possible to delete one or more rows from the DataFrame. For this, we will use the function drop.

#Deleting one or several rows
df_copy_drop = df_copy.drop([100, 525, 9461], axis=0)


Printing the DataFrame print(df_copy_drop)

Please notice that the DataFrame df_copy_drop has 3 rows less (94262 rows) than the DataFrame df_copy (94265). It is also important to keep in mind that when we use the function drop, axis=0 stands for removing rows, while axis=1 stands for columns.

Now, we will get certain rows from the original DataFrame df according to a certain condition. For example, we can get the data per country. For this, we will use the input function.

#Getting data per country
country = input('Please type your country of interest:')
df_country = df[df['Country'] == country]


#Printing the DataFrame print(df_country)

As seen, once the user enters the desired country, Python automatically outputs the data for this specific country. Note that the country name to be inputted should be in the text file, otherwise, you will get an error, because Python will not find any row with your input.

Just as Python can import a file, Python is also capable of exporting the DataFrame to an Excel file. To finalize this exercise, I will show you how this can be done.

#Exporting DataFrame to Excel
excel_file = df_country.to_excel('Vaccinations by country.xlsx')

If you go to the folder where your Python script is saved, then the exported Excel file can be found.

The final Python code will look like this:

#Importing library
import pandas as pd


#Importing the text file df = pd.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3) print(df)
#Getting the 5 first rows of the DataFrame df.head()
#Getting the 5 last rows of the DataFrame df.tail()
#Setting a header for the DataFrame df.columns = ['Country','Country iso code','Date','Total vaccinations','People vaccinated','People fully vaccinated','Total boosters','Daily vaccinations raw','Daily vaccinations','Total vaccinations per hundred','People vaccinated per hundred','People fully vaccinated per hundred','Total boosters per hundred','Daily vaccinations per million','Daily people vaccinated','Daily people vaccinated per hundred'] print(df)
#Copying the DataFrame to another DataFrame df_copy = df.copy() print(df_copy)
#del function for removing one column only del df_copy['Daily vaccinations'] print(df_copy)
#drop() function for removing one or more columns
#Method 1: Using axis parameter df_copy_drop1 = df_copy.drop(['Total vaccinations','People vaccinated','Country iso code'], axis=1) print(df_copy_drop1)
#Method 2: Using columns parameter df_copy_drop2 = df_copy.drop(columns=['Total vaccinations','People vaccinated','Country iso code']) print(df_copy_drop2)
#Deleting one or several rows df_copy_drop = df_copy.drop([100, 525, 9461], axis=0) print(df_copy_drop)
#Getting data per country country = input('Please type your country of interest:') df_country = df[df['Country'] == country] print(df_country)
#Exporting DataFrame to Excel excel_file = df_country.to_excel('Vaccinations by country.xlsx')

Congratulations! You just learned how to manipulate a vast amount of data! To download the complete code and the text file containing the data used in this tutorial, please click here.

Views: 1

joushe info

Looking for new horizons

Looking for new horizons

Notifications

Receive the new articles in your email

2022 © nepy

One single breath

Configure

Choose your own style

Color

Choose the perfect color mode for you


Navigation Position

Select a suitable navigation system


Vertical Navbar Style

Switch between styles for your vertical navbar

Customize