Mastering Logical Comparison, Control Flow, Filtering on Numpy Array and Pandas DataFrame

Mastering Logical Comparison, Control Flow, Filtering on Numpy Array and Pandas DataFrame

·

8 min read

In this article, we will learn about different comparison operators, how to combine them with Boolean operators, and how to use Boolean outcomes in control structures. Boolean logic is the foundation of decision-making in Python programs. We'll also learn to filter data in pandas DataFrames using logic, a skill that a data scientist must have.

Comparison Operators

Comparison operators are operators that can tell how two values relate, and result in a boolean.

Numeric comparisons

In the simplest sense, we can use these operators on numbers. For example, if we want to check if 2 is smaller than 3, we type 2 less than sign 3.

print(2 < 3)
output:
True

Because 2 is less than 3, we get True. we can also check if two values are equal, with a double equals sign. From this call, we see that 5 equals 6 gives us False.

print(5 == 6)
output:
False

It makes sense because 5 is not equal to 6. We can also make a combination of equality and smaller than. Have a look at this command that checks if 5 is smaller than or equal to 6.

print(5 <= 6)
output:
True

It's TRUE, but also 6 smaller than or equal to 6 is True.

print(6 <= 6)
output:
True

Of course, we can also use comparison operators directly on variables that represent these integers.

x = 5
y = 6
print(x < y)
output:
True

Comparison between strings

All these operators also work for strings. Let's check if "abc" is smaller than "acd".

print("abc" < "acd")
output:
True

According to the alphabet order, "abc" comes before "acd", so the result is True.

Comparison between integer and string

Let's find out if comparing a string and an integer works. Here if the integer 2 is smaller than the string "abc".

print(2 < "abc")
output:
TypeError: '<' not supported between instances of 'int' and 'str'

We get an error (TypeError: '<' not supported between instances of 'int' and 'str'). Typically, Python can't tell how two objects with different types relate.

Comparison between integer and float

Different numeric types, such as floats and integers, are exceptions.

print(3 < 4.12)
output:
True

No error this time. In general, always make sure that we make comparisons between objects of the same type.

Compare on Numpy array

Another exception arises when we compare on NumPy array, lengths, with an integer, 22. This works perfectly.

import numpy as np
lengths = np.array([21.85, 20.97, 21.75, 24.74, 21.44])
print(type(lengths))
print(lengths > 22)
output:
<class 'numpy.ndarray'>
[False False False  True False]

NumPy figures out that we want to compare every element in lengths with 22, and returns corresponding booleans. Behind the scenes, NumPy builds a NumPy array of the same size filled with the number 22, and then performs an element-wise comparison. This is concise, very efficient code, which data scientists love!

We can also compare two NumPy arrays element-wise. house1 and house2 contain the areas for the kitchen, living room, bedroom and bathroom in the same order. Which areas in house1 are smaller than the ones in house2 like this?

house1 = np.array([18.0, 20.0, 10.75, 9.50])
house2 = np.array([14.0, 24.0, 14.25, 9.0])
print(house1 < house2)
output:
[False  True  True False]

It appears that the living room and bedroom in house1 are smaller than the corresponding areas in house2.

Comparators

Here is the table that summarizes all comparison operators.

ComparatorMeaning
<less than
<=less than or equal to
\>greater than
\>=greater than or equal to
\==equal to
!=not equal to

We are already familiar with some of these. They're all pretty straightforward, except for the not equal !=. The exclamation mark followed by an equals sign stands for inequality. It's the opposite of equality.

Equality

To check if two Python values, or variables, are equal you can use ==. To check for inequality, you need !=. Have a look at the following examples that all result in True.

print(2 == (1 + 1))
print("PYTHON" != "python")
print(True != False)
print("Python" != "python")
output:
True
True
True
True

Write a code to see if True equals False.

# Comparison of booleans
print(True == False)
output:
False

Write Python code to check if -3 * 15 is not equal to 45.

# Comparison of integers
print(( -3 * 15 ) != 45)
output:
True

Ask Python whether the strings "python" and "Python" are equal.

# Comparison of strings
print("python" == "Python")
output:
False

Note that strings are case-sensitive. What happens if you compare booleans and integers? Write code to see if True and 1 are equal.

# Compare a boolean with an integer
print(True == 1)
print(True == 2)
output:
True
False

A boolean is a special kind of integer: True corresponds to 1, False corresponds to 0.

Greater and less than

We also talked about the less than and greater than signs, < and > in Python. We can combine them with an equals sign to get <= and >=. Note that =< and => are not valid. For examples.

print(3 < 4)
print(3 <= 4)
print("alpha" <= "beta")
output:
True
True
True

Remember that for string comparison, Python determines the relationship based on alphabetical order.

Check if x is greater than or equal to -13.

# Comparison of integers
x = -4 * 3
print(x >= -13)
output:
True

Check if True is greater than False.

# Comparison of booleans
print(True > False)
output:
True

Remember that True is 1 and False is 0 in value.

Boolean Operators

We can produce booleans by performing comparison operations. The next step is combining these booleans. We can use boolean operators for this. The three most common ones are

  • and,

  • or, and

  • not.

and

The and operator works just as we would expect. It takes two booleans and returns True only if both the booleans themselves are True.

Case1Case2Case1 and Case2
TrueTrueTrue
TrueFalseFalse
FalseTrueFalse
FalseFalseFalse
print(True and True)
print(True and False)
print(False and True)
print(False and False)
output:
True
False
False
False

Instead of using booleans, we can also use the results of comparisons. Suppose we have a variable x, equal to 8. To check if this variable is greater than 5 but less than 15, we can use x greater than 5 and x less than 15.

x = 8
print(x > 5 and x < 15)
output:
True

As we already learned, the first part will evaluate to True. The second part will also evaluate to True. So the result of this expression, True and True, is True. This makes sense, because 8 lies between 5 and 15.

or

The or operator works similarly, but the difference is that only at least one of the booleans should be True.

Case1Case2Case1 or Case2
TrueTrueTrue
TrueFalseTrue
FalseTrueTrue
FalseFalseFalse
print(True or True)
print(True or False)
print(False or True)
print(False or False)
output:
True
True
True
False

Also here we can make combinations with variables, like this example that checks if a variable y, which is equal to 3, is less than 5 or above 10.

y = 3
print(y < 5 or y > 10)
output:
True

3 less than 5 is True, 3 greater than 10 is False. The or operation thus returns True.

not

Finally, let's the not operator. It simply negates the boolean value we use it on. not True is False, not False is True. The not operation is typically useful if we're combining different boolean operations and then want to negate that result.

print(not True)
print(not False)
output:
False
True

Nested Boolean operators

Let's take the boolean operators to another level.
Note that not has a higher priority than and and or, it is executed first.

x = 8
y = 9
not(not(x < 3) and not(y < 8 or y > 14))
output:
False

Correct! x < 3 is False. y < 8 or y > 14 is False as well. If you continue working like this, simplifying from inside to outward, you'll end up with False.

Filtering on NumPy arrays

Now, for NumPy arrays, things are different. Retaking the lengths example, we can try to find out which lengths are higher than 21, but lower than 22. The output of lengths greater than 21 is easily found, so is the one for the lengths lower than 22.

print(lengths)
print(lengths > 21)
print(lengths < 22)
output:
[21.85 20.97 21.75 24.74 21.44]
[ True False  True  True  True]
[ True  True  True False  True]

Let's now try to combine those with the and operator we just learned.

print(lengths > 21 and lengths < 22)
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Oops, python return ValueError: The truth value of an array with more than one element is ambiguous. Clearly it doesn't like an array of booleans to work on.

Numpy provides these "array equivalents" of and, or and not functions,

  • logical_and,

  • logical_or and

  • logical_not.

To find out which lengths are between 21 and 22, we will use these functions. Again, as we expect from NumPy, the and operation is performed element-wise.

print(np.logical_and(lengths > 21, lengths < 22))
output:
[ True False  True False  True]

To select only these lengths are between 21 and 22, we can use the resulting array of booleans in square brackets.

print(lengths[np.logical_and(lengths > 21, lengths < 22)])
output:
[21.85 21.75 21.44]

Again, NumPy wins when it comes to writing short yet very expressive Python code. How about this on Pandas DataFrames, the de facto standard for dataset manipulation?

Boolean operators on NumPy Array

Before, the operational operators like < and >= worked with NumPy arrays out of the box. Unfortunately, this is not true for the boolean operators and, or, and not.
To use these operators with NumPy, we will need np.logical_and(), np.logical_or() and np.logical_not(). Here's an example on the house1 and house2 arrays.

Generate boolean arrays that answer the following questions:

  • Which areas in my_house are greater than 18.5 or smaller than 10?
# house1 greater than 18.5 or smaller than 10
print(np.logical_or(house1 > 18.5, house2 < 10))
output:
[False  True False  True]
  • Which areas are smaller than 11 in both house1 and house2?
# Both house1 and house2 smaller than 11
print(np.logical_and(house1 < 11, house2 < 11))
output:
[False False False  True]

Filtering on pandas DataFrames

The NumPy array can be useful to do comparison operations and boolean operations on an element-wise basis. Let's now use this knowledge on Pandas DataFrame. Click here to download the countries.csv file. First, let's import the countries dataset from the CSV file using pandas.

import pandas as pd
countries = pd.read_csv('countries.csv', index_col=0)
print(countries)
output:
       country    capital  population
IND      India  New Delhi  1393409030
MMR    Myanmar     Yangon    54806010
THA   Thailand    Bangkok    69950840
SGP  Singapore  Singapore     5453570
CHN      China    Beijing  1412360000

Suppose you now want to keep the countries, for which the population is greater than 100,000,000. There are three steps to this.

  1. First of all, we want to get the population column from countries.

  2. Next, we perform the comparison on this column and store its result.

  3. Finally, we should use this result to do the appropriate selection on the DataFrame.

Step 1: Get the column

So the first step, getting the population column from countries. There are many different ways to do this. What's important here, is that we ideally get a Pandas Series, not a Pandas DataFrame. Let's do this with square brackets, like this.

print(type(countries['population']))
print(countries['population'])
output:
<class 'pandas.core.series.Series'>
IND    1393409030
MMR      54806010
THA      69950840
SGP       5453570
CHN    1412360000
Name: population, dtype: int64

This loc alternative and this iloc version would also work perfectly fine.

print(countries.loc[:, 'population'])
output:
IND    1393409030
MMR      54806010
THA      69950840
SGP       5453570
CHN    1412360000
Name: population, dtype: int64
print(countries.iloc[:, 2])
output:
IND    1393409030
MMR      54806010
THA      69950840
SGP       5453570
CHN    1412360000
Name: population, dtype: int64

Step 2: Compare

Next, we perform the comparison. To see which rows have a population greater than 100,000,000, we simply append greater than 100000000 to the code from before, like this.

print(countries['population'] > 100000000)
output:
IND     True
MMR    False
THA    False
SGP    False
CHN     True
Name: population, dtype: bool

Now we get a Series containing booleans. If you compare it to the population values, you can see that the population with a value over 100000000 corresponds to True, and the ones with a value under 100000000 correspond to False now. Let's store this Boolean Series as is_huge.

is_huge = countries['population'] > 100000000
print(is_huge)
output:
IND     True
MMR    False
THA    False
SGP    False
CHN     True
Name: population, dtype: bool

Step 3: Subset the DataFrame

The final step is using this boolean Series is_huge to subset the Pandas DataFrame. To do this, we put is_huge inside square brackets.

print(countries[is_huge])
output:
    country    capital  population
IND   India  New Delhi  1393409030
CHN   China    Beijing  1412360000

The result is exactly what we want: only the countries with an population greater than 100000000, namely India and China.

Summary

So let's summarize this: we selected the population column, performed a comparison on the population column and stored it as is_huge so that we can use it to index the countries DataFrame. These different commands do the trick. However, we can also write this in one line. simply put the code that defines is_huge directly in the square brackets.

print(countries[countries['population'] > 100000000])
output:
    country    capital  population
IND   India  New Delhi  1393409030
CHN   China    Beijing  1412360000

Great! Pandas help data scientists' life much easy.

Boolean operators on Pandas DataFrame

Now we haven't used boolean operators yet. Remember that we used this logical_and function from the NumPy package to do an element-wise boolean operation on NumPy arrays? Because Pandas is built on NumPy, we can also use that function here. Let's write the codes which keep the observations that have a population between 10,000,000 and 90,000,000.

print(countries)
output:
       country    capital  population
IND      India  New Delhi  1393409030
MMR    Myanmar     Yangon    54806010
THA   Thailand    Bangkok    69950840
SGP  Singapore  Singapore     5453570
CHN      China    Beijing  1412360000
print(np.logical_and(countries['population'] > 10000000, countries['population'] < 90000000))
output:
IND    False
MMR     True
THA     True
SGP    False
CHN    False
Name: population, dtype: bool

The only thing left to do is placing this code inside square brackets to subset countries appropriately. This time, only Myanmar and Thailand are included. Look how easy it is to filter DataFrames to get interesting results.

print(countries[np.logical_and(countries['population'] > 10000000, countries['population'] < 90000000)])
output:
      country  capital  population
MMR   Myanmar   Yangon    54806010
THA  Thailand  Bangkok    69950840

Now we know about comparison operators such as

  • <

  • <=

  • >

  • >=

  • ==

  • !=

and we also know how to combine the boolean results, using boolean operators such as

  • and,

  • or and

  • not.

Control Flow

Things get interesting when we can use these concepts to change how our program behaves. Depending on the outcome of our comparisons, we might want our Python code to behave differently. we can do this with conditional statements in Python:

  • if,

  • else and

  • elif.

if

Suppose we have a variable x, equal to 4. If the value is even, we want to print out: "x is even".

x = 4
if x % 2 == 0:
    print('x is even.')
output:
x is even.

The modulo operator % with 2 will return 0 if x is even. Python checks if the condition holds. It's true, so the corresponding code is executed: "x is even" gets printed out.

Let's compare this to the general recipe for an if statement. It reads as follows: if the condition is True, execute the codes.
Notice the colon at the end, and the fact that we simply have to indent the Python code with four spaces (or a tab) to tell Python what to do in case the condition succeeds. To exit the if statement, simply continues with some Python code without indentation, and Python will know that it's not part of the if statement. It's perfectly possible to have more lines inside the if statement, like this for example.

x = 4
if x % 2 == 0:
    print('Cheching if x (', x, ') is divisible by 2...')
    print('x is even.')
output:
Cheching if x ( 4 ) is divisible by 2...
x is even.

The script now prints out two lines if we run it. If the condition does not pass, the expression is not executed. You can see this if we change x to be 3 and rerun the code.

x = 3
if x % 2 == 0:
    print('Cheching if x (', x, ') is divisible by 2...')
    print('x is even.')
output:

There's no output. Suppose now that we want to print out "x is odd" in this case. How to do this?

else

Well, we can simply use an else statement, like this.

x = 3
if x % 2 == 0:
    print('x is even.')
else:
    print('x is odd.')
output:
x is odd.

If we run it with x equal to 3, the condition is not true, so the expression for the else statement gets printed out. The general recipe looks like this: for the else statement, we don't need to specify a condition. The else corresponding expression gets run if the condition of the if statements don't hold True.

elif

We can think of cases where even more customized behavior is necessary. Say we want different printouts for numbers that are divisible by 2 and by 3. We can use some elif in there to get the job done. Here is an example.

x = 3
if x % 2 == 0: # False
    print('x is divisible by 2.')
elif x % 3 == 0: # True
    print('x is divisible by 3.')
else:
    print('x is not divisible by both 2 & 3.')
output:
x is divisible by 3

If x equals 3, the first condition is False, so it goes over to check the next condition. This condition holds True so the corresponding print statement is executed.

Suppose now that x equals 6. Both the if and elif conditions hold True in this case. Will two printouts occur?

x = 6
if x % 2 == 0: # True
    print('x is divisible by 2.')
elif x % 3 == 0: # never reach here
    print('x is divisible by 3.')
else:
    print('x is not divisible by both 2 & 3.')
output:
x is divisible by 2.

Nope. As soon as Python finds a true condition, it executes the corresponding code and then leaves the whole control structure after that. This means the second condition, corresponds to the elif, is never reached so there's no corresponding printout. Control flow can be extremely powerful when we're writing Python scripts.

Conclusion

In this article, we learned logical comparison, control flow, and filtering on Numpy Array and Pandas DataFrame.


#python #pandas #numpy #datascience #logical-comparison #control-flow #filtering