In this Pandas tutorial, we will learn how to use Pandas to randomly select rows and columns from a dataframe. There are some reasons for randomly sampling our data; for instance, we may have a very large dataset and want to build our models on a smaller sample. Other examples are when carrying out bootstrapping or cross-validation. Here, we will learn how to select rows at random, set a random seed, and sample by group, using weights and conditions, among other useful things.
Table of Contents
- Example Data
- How to Take a Random Sample of Rows
- How to Sample Pandas Dataframe using frac
- How to Shuffle Pandas Dataframe using Numpy
- Pandas Sample with Replacement
- Sample Dataframe with Seed
- Pandas Sample with Weights
- Pandas Sample of Rows by Group
- Pandas Random Sample with Condition
- Using Pandas Sample and Remove Random Rows
- Saving the Pandas Sample
- Summary
Example Data
In this section, we will learn how to take a random sample of a Pandas dataframe. We are going to use an Excel file that can be downloaded here. First, we start by importing Pandas, and we use read_excel to load the Excel file into a dataframe:
import pandas as pd
# reading data from an Excel file:
df = pd.read_excel('MLBPlayerSalaries.xlsx')
# Displaying the first 5 rows:
df.head()
Code language: Python (python)
We use the method shape to see how many rows and columns are in our dataframe. This method is very similar to the dim function in the R statistical programming language (see here).
# Get number of rows and columns:
df.shape
Code language: CSS (css)
- Read the Pandas Excel Tutorial to learn more about loading Excel files into Pandas dataframes.
Note that you can convert a dictionary to a data frame. Finally, before we take random samples of rows, it may be useful to change the column names in the Pandas data frame—if needed, of course.
The easiest way to randomly select rows from a Pandas dataframe is to use the sample() method. For example, if your dataframe is called “df”, df.sample(n=250)
will result in that 200 rows being selected randomly. Removing the n
parameter will result in one random row instead of multiple rows.
How to Take a Random Sample of Rows
Now that we know how many rows and columns there are (19543 and 5 rows and columns, respectively), we will continue using the Pandas sample. In the example below, we are not going to use any parameters.
Sample One Row Randomly
The default behavior, when not using any parameters, is sampling one row:
# Take one random sample of the rows
df.sample()
Code language: Python (python)
Random Sample of Rows from Pandas Dataframe:
We usually want to take random samples of more rows than one. Thus, we will take a random sample of 200 in the following Pandas sample example. We are going to use the parameter n
to accomplish this. Here’s how to randomly sample 200 rows from the dataframe:
# Random sample of 200 rows
df.sample(n=200).head(10)
Code language: Python (python)
As seen in the above image, we also used the head
method to print only the 10 first rows of the randomly sampled rows. In most cases, we may want to save the randomly sampled rows. To accomplish this, we will create a new dataframe:
# Saving the randomly sampled rows:
df200 = df.sample(n=200)
df200.shape
# Output: (200, 5)
Code language: Python (python)
In the code above we created a new dataframe, called df200, with 200 randomly selected rows. Again, we used the method shape to see how many rows (and columns) we now have.
Random Sampling Rows using NumPy Choice
Of course, it is very easy and convenient to use Pandas’ sample method to take a random sample of rows. Note, however, that it’s possible to use NumPy and random.choice. In the example below, we will get the same result as above by using np.random.choice.
As usual, when working with Python modules, we import NumPy. After this is done, we will continue to create an array of indices (rows) and then use Pandas loc
method to select the rows based on the random indices:
import numpy as np
rows = np.random.choice(df.index.values, 200)
df200 = df.loc[rows]
df200.head()
Code language: JavaScript (javascript)
How to Sample Pandas Dataframe using frac
Now that we have used NumPy, we will continue this Pandas dataframe sample tutorial by using the sample method’s frac
parameter. This parameter specifies the fraction (percentage) of rows to return in the random sample. Setting frac to 1 (frac=1
) will return all rows in random order. If we just want to shuffle the data frame, we can do so using sample
and the parameter frac.
# Pandas random sample of rows based on fraction:
df.sample(frac=1).head()
Code language: Python (python)
As can be seen in the output table above, the rows are now ordered randomly. We can use shape
, again, to know that we have the same amount of rows:
df.sample(frac=1).shape
# Output: (19543, 5)
Code language: Python (python)
As expected, there are as many rows and columns as in the original dataframe.
Randomly Select a Fraction of the Rows
We can use frac to get 200 randomly selected rows, for example. Before doing this, we will need to calculate the percentage of 200 of our total number of rows. In this case, it’s approximately 1% of the data, and using the code below will also give us 200 random rows from the data frame.
# Getting roughly 1% of the dataframe:
df200 = df.sample(frac=.01023)
Code language: Python (python)
Note that the frac
parameter cannot be used together with n. We will get a ValueError
that states we cannot enter a value for both frac
and n
.
How to Shuffle Pandas Dataframe using Numpy
Here, we will use another method to shuffle the dataframe. In the example code below, we will use the Python module NumPy again. We have to use reindex
(Pandas) and random.permutation
(NumPy). More specifically, we will permute the dataframe using the indices:
df_shuffled = df.reindex(np.random.permutation(df.index))
Code language: Python (python)
Pandas Sample with Replacement
We can also, of course, sample with replacement. By default Pandas sample
will sample without replacement. In some cases we have to sample with replacement (e.g., with really large datasets). If we want to sample with replacement, we should use the replace parameter:
# Pandas random sample of 5 rows with replacement:
df5 = df.sample(n=5, replace=True)
Code language: Python (python)
Of course, we can use both the parameters frac and random_state, or n and random_state, together. In the example below, we randomly select 50% of the rows and use the random_state
. It is also possible to use the replace=True
parameter together with frac and random_state to get a reproducible percentage of rows with replacement.
Sample Dataframe with Seed
If we want to reproduce our random sample of rows, we can use the random_state
parameter. This is the seed for the random number generator, and we need to input an integer:
df200 = df.sample(n=200, random_state=1111)
Code language: Python (python)
Of course, we can use both the parameters frac
and random_state
, or n
and random_state
, together. In the example below, we randomly select 50% of the rows and use the random_state. It is also possible to use the replace=True
parameter together with frac and random_state
to get a reproducible percentage of rows with replacement.
df200 = df.sample(frac=.5, replace=True, random_state=1111)
Code language: Python (python)
Pandas Sample with Weights
The sample method also has parameter weights
, which can be used to increase the probability of sampling certain rows. We start the next Pandas sample example by importing NumPy.
import numpy as np
df['Weights'] = np.where(df['Year'] <= 2000, .75, .25)
df['Weights'].unique()
# Output: array([0.75 , 0.25])
Code language: Python (python)
In the code above we used NumPy’s where to create a new column ‘Weights’. Up until the year 2000 the weights are .5. This will increase the probability for Pandas sample
to select rows up until this year:
df2 = df.sample(frac=.5, random_state=1111, weights='Weights')
df2.shape
# Output: (9772, 6)
Code language: Python (python)
Pandas Sample of Rows by Group
It’s also possible to sample each group after using the Pandas groupby
method. In the example below,
we are going to group the dataframe by player and then take 2 samples of data from each player:
grouped = df.groupby('Player')
grouped.apply(lambda x: x.sample(n=2, replace=True)).head()
Code language: Python (python)
The code above may need some clarification. In the second line, we used Pandas’ apply method and the anonymous Python function lambda
. It will run a sample on each subset (i.e., for each Player) and take 2 random rows. Note that here, we have to use replace=True
, or else it won’t work. If you want to learn more about working with Pandas’ groupby method, see the Pandas Groupby Tutorial.
Pandas Random Sample with Condition
Say that we want to take a random sample of players with a salary under 421000 (or rows when the salary is under this number). For some players, this could be certain years. This is quite easy. In the example below, we sample 10% of the data frame based on this condition.
df[df['Salary'] < 421000].sample(frac=.1).head()
Code language: Python (python)
It’s also possible to have more than one condition. We just have to add some code to the above example. Now we are going to sample salaries under 421000 and prior to the year 2000:
df[(df['Salary'] < 421000) & (df['Year'] < 2000)].sample(frac=.1)
Code language: Python (python)
Using Pandas Sample and Remove Random Rows
We may want to take a random sample from our dataframe and remove those rows. Or, we may want to create two different dataframes: one with 80% of the rows and one with the remaining 20%. Of course, both of these things can be done using the sample and drop methods. In the code example below, we create two new dataframes: one with 80% of the rows and one with the remaining 20%.
df1 = df.sample(frac=0.8, random_state=138)
df2 = df.drop(df1.index)
Code language: Python (python)
If we merely want to remove random rows, we can use drop and the inplace
parameter:
df.drop(df1.index, inplace=True)
df.shape
# Same as: df.drop(df.sample(frac=0.8, random_state=138).index,
inplace=True)
# Output: (3909, 5)
Code language: Python (python)
More useful Pandas guides:
Saving the Pandas Sample
Finally, we may also want to save the data to work on later. In the example code below, we will save a Pandas sample to csv. To accomplish this, we use the to_csv
method. The first parameter is the filename, and because we don’t want an index column in the file, we use index_col=False.
import pandas as pd
df = pd.read_excel('MLBPlayerSalaries.xlsx')
df.sample(200, random_state=1111).to_csv('MBPlayerSalaries200Sample.csv',
index_col=False)
Code language: Python (python)
Summary
In this brief Pandas tutorial, we learned how to use the sample method. More specifically, we have learned how to:
- take a random sample of a data using the n (a number of rows) and frac (a percentage of rows) parameters,
- get reproducible results using a seed (random_state),
- sample by group, sample using weights, and sample with conditions
- create two samples and delete random rows
- saving the Pandas sample
That was it! Now we should know how to use Pandas sample.