Mastering Logical Comparison, Control Flow, Filtering on Numpy Array and Pandas DataFrame
In this article, we will learn about different comparison operators, how to combine them with Boolean operators, and how to use Boolean outcomes in control structures. Boolean logic is the foundation of decision-making in Python programs. We'll also learn to filter data in pandas DataFrames using logic, a skill that a data scientist must have.
Comparison Operators
Comparison operators are operators that can tell how two values relate, and result in a boolean.
Numeric comparisons
In the simplest sense, we can use these operators on numbers. For example, if we want to check if 2 is smaller than 3, we type 2 less than sign 3.
print(2 < 3)
output:
True
Because 2 is less than 3, we get True
. we can also check if two values are equal, with a double equals sign. From this call, we see that 5 equals 6 gives us False
.
print(5 == 6)
output:
False
It makes sense because 5 is not equal to 6. We can also make a combination of equality and smaller than. Have a look at this command that checks if 5 is smaller than or equal to 6.
print(5 <= 6)
output:
True
It's TRUE, but also 6 smaller than or equal to 6 is True.
print(6 <= 6)
output:
True
Of course, we can also use comparison operators directly on variables that represent these integers.
x = 5
y = 6
print(x < y)
output:
True
Comparison between strings
All these operators also work for strings. Let's check if "abc" is smaller than "acd".
print("abc" < "acd")
output:
True
According to the alphabet order, "abc" comes before "acd", so the result is True.
Comparison between integer and string
Let's find out if comparing a string and an integer works. Here if the integer 2 is smaller than the string "abc".
print(2 < "abc")
output:
TypeError: '<' not supported between instances of 'int' and 'str'
We get an error (TypeError: '<' not supported between instances of 'int' and 'str'
). Typically, Python can't tell how two objects with different types relate.
Comparison between integer and float
Different numeric types, such as floats and integers, are exceptions.
print(3 < 4.12)
output:
True
No error this time. In general, always make sure that we make comparisons between objects of the same type.
Compare on Numpy array
Another exception arises when we compare on NumPy array, lengths
, with an integer, 22
. This works perfectly.
import numpy as np
lengths = np.array([21.85, 20.97, 21.75, 24.74, 21.44])
print(type(lengths))
print(lengths > 22)
output:
<class 'numpy.ndarray'>
[False False False True False]
NumPy figures out that we want to compare every element in lengths
with 22
, and returns corresponding booleans. Behind the scenes, NumPy builds a NumPy array of the same size filled with the number 22
, and then performs an element-wise comparison. This is concise, very efficient code, which data scientists love!
We can also compare two NumPy arrays element-wise. house1
and house2
contain the areas for the kitchen, living room, bedroom and bathroom in the same order. Which areas in house1
are smaller than the ones in house2
like this?
house1 = np.array([18.0, 20.0, 10.75, 9.50])
house2 = np.array([14.0, 24.0, 14.25, 9.0])
print(house1 < house2)
output:
[False True True False]
It appears that the living room and bedroom in house1
are smaller than the corresponding areas in house2
.
Comparators
Here is the table that summarizes all comparison operators.
Comparator | Meaning |
< | less than |
<= | less than or equal to |
\> | greater than |
\>= | greater than or equal to |
\== | equal to |
!= | not equal to |
We are already familiar with some of these. They're all pretty straightforward, except for the not equal !=
. The exclamation mark followed by an equals sign stands for inequality. It's the opposite of equality.
Equality
To check if two Python values, or variables, are equal you can use ==
. To check for inequality, you need !=
. Have a look at the following examples that all result in True
.
print(2 == (1 + 1))
print("PYTHON" != "python")
print(True != False)
print("Python" != "python")
output:
True
True
True
True
Write a code to see if True
equals False
.
# Comparison of booleans
print(True == False)
output:
False
Write Python code to check if -3 * 15
is not equal to 45
.
# Comparison of integers
print(( -3 * 15 ) != 45)
output:
True
Ask Python whether the strings "python"
and "Python"
are equal.
# Comparison of strings
print("python" == "Python")
output:
False
Note that strings are case-sensitive. What happens if you compare booleans and integers? Write code to see if True
and 1
are equal.
# Compare a boolean with an integer
print(True == 1)
print(True == 2)
output:
True
False
A boolean is a special kind of integer: True
corresponds to 1
, False
corresponds to 0
.
Greater and less than
We also talked about the less than and greater than signs, <
and >
in Python. We can combine them with an equals sign to get <=
and >=
. Note that =<
and =>
are not valid. For examples.
print(3 < 4)
print(3 <= 4)
print("alpha" <= "beta")
output:
True
True
True
Remember that for string comparison, Python determines the relationship based on alphabetical order.
Check if x
is greater than or equal to -13
.
# Comparison of integers
x = -4 * 3
print(x >= -13)
output:
True
Check if True
is greater than False
.
# Comparison of booleans
print(True > False)
output:
True
Remember that True
is 1 and False
is 0 in value.
Boolean Operators
We can produce booleans by performing comparison operations. The next step is combining these booleans. We can use boolean operators for this. The three most common ones are
and
,or
, andnot
.
and
The and
operator works just as we would expect. It takes two booleans and returns True
only if both the booleans themselves are True
.
Case1 | Case2 | Case1 and Case2 |
True | True | True |
True | False | False |
False | True | False |
False | False | False |
print(True and True)
print(True and False)
print(False and True)
print(False and False)
output:
True
False
False
False
Instead of using booleans, we can also use the results of comparisons. Suppose we have a variable x
, equal to 8
. To check if this variable is greater than 5 but less than 15, we can use x
greater than 5
and x
less than 15
.
x = 8
print(x > 5 and x < 15)
output:
True
As we already learned, the first part will evaluate to True
. The second part will also evaluate to True
. So the result of this expression, True and True
, is True
. This makes sense, because 8
lies between 5
and 15
.
or
The or
operator works similarly, but the difference is that only at least one of the booleans should be True
.
Case1 | Case2 | Case1 or Case2 |
True | True | True |
True | False | True |
False | True | True |
False | False | False |
print(True or True)
print(True or False)
print(False or True)
print(False or False)
output:
True
True
True
False
Also here we can make combinations with variables, like this example that checks if a variable y
, which is equal to 3
, is less than 5
or above 10
.
y = 3
print(y < 5 or y > 10)
output:
True
3
less than 5
is True
, 3
greater than 10
is False
. The or
operation thus returns True
.
not
Finally, let's the not
operator. It simply negates the boolean value we use it on. not True is False, not False is True. The not
operation is typically useful if we're combining different boolean operations and then want to negate that result.
print(not True)
print(not False)
output:
False
True
Nested Boolean operators
Let's take the boolean operators to another level.
Note that not
has a higher priority than and
and or
, it is executed first.
x = 8
y = 9
not(not(x < 3) and not(y < 8 or y > 14))
output:
False
Correct! x < 3
is False
. y < 8 or y > 14
is False
as well. If you continue working like this, simplifying from inside to outward, you'll end up with False
.
Filtering on NumPy arrays
Now, for NumPy arrays, things are different. Retaking the lengths
example, we can try to find out which lengths
are higher than 21
, but lower than 22
. The output of lengths
greater than 21
is easily found, so is the one for the lengths
lower than 22
.
print(lengths)
print(lengths > 21)
print(lengths < 22)
output:
[21.85 20.97 21.75 24.74 21.44]
[ True False True True True]
[ True True True False True]
Let's now try to combine those with the and
operator we just learned.
print(lengths > 21 and lengths < 22)
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Oops, python return ValueError: The truth value of an array with more than one element is ambiguous
. Clearly it doesn't like an array of booleans to work on.
Numpy provides these "array equivalents" of and
, or
and not
functions,
logical_and
,logical_or
andlogical_not
.
To find out which lengths
are between 21
and 22
, we will use these functions. Again, as we expect from NumPy, the and
operation is performed element-wise.
print(np.logical_and(lengths > 21, lengths < 22))
output:
[ True False True False True]
To select only these lengths
are between 21
and 22
, we can use the resulting array of booleans in square brackets.
print(lengths[np.logical_and(lengths > 21, lengths < 22)])
output:
[21.85 21.75 21.44]
Again, NumPy wins when it comes to writing short yet very expressive Python code. How about this on Pandas DataFrames, the de facto standard for dataset manipulation?
Boolean operators on NumPy Array
Before, the operational operators like <
and >=
worked with NumPy arrays out of the box. Unfortunately, this is not true for the boolean operators and
, or
, and not
.
To use these operators with NumPy, we will need np.logical_and()
, np.logical_or()
and np.logical_not()
. Here's an example on the house1
and house2
arrays.
Generate boolean arrays that answer the following questions:
- Which areas in
my_house
are greater than18.5
or smaller than10
?
# house1 greater than 18.5 or smaller than 10
print(np.logical_or(house1 > 18.5, house2 < 10))
output:
[False True False True]
- Which areas are smaller than
11
in bothhouse1
andhouse2
?
# Both house1 and house2 smaller than 11
print(np.logical_and(house1 < 11, house2 < 11))
output:
[False False False True]
Filtering on pandas DataFrames
The NumPy array can be useful to do comparison operations and boolean operations on an element-wise basis. Let's now use this knowledge on Pandas DataFrame. Click here to download the countries.csv
file. First, let's import the countries
dataset from the CSV file using pandas.
import pandas as pd
countries = pd.read_csv('countries.csv', index_col=0)
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Suppose you now want to keep the countries, for which the population is greater than 100,000,000. There are three steps to this.
First of all, we want to get the population column from
countries
.Next, we perform the comparison on this column and store its result.
Finally, we should use this result to do the appropriate selection on the DataFrame.
Step 1: Get the column
So the first step, getting the population
column from countries
. There are many different ways to do this. What's important here, is that we ideally get a Pandas Series, not a Pandas DataFrame. Let's do this with square brackets, like this.
print(type(countries['population']))
print(countries['population'])
output:
<class 'pandas.core.series.Series'>
IND 1393409030
MMR 54806010
THA 69950840
SGP 5453570
CHN 1412360000
Name: population, dtype: int64
This loc
alternative and this iloc
version would also work perfectly fine.
print(countries.loc[:, 'population'])
output:
IND 1393409030
MMR 54806010
THA 69950840
SGP 5453570
CHN 1412360000
Name: population, dtype: int64
print(countries.iloc[:, 2])
output:
IND 1393409030
MMR 54806010
THA 69950840
SGP 5453570
CHN 1412360000
Name: population, dtype: int64
Step 2: Compare
Next, we perform the comparison. To see which rows have a population greater than 100,000,000
, we simply append greater than 100000000
to the code from before, like this.
print(countries['population'] > 100000000)
output:
IND True
MMR False
THA False
SGP False
CHN True
Name: population, dtype: bool
Now we get a Series containing booleans. If you compare it to the population values, you can see that the population with a value over 100000000 corresponds to True, and the ones with a value under 100000000 correspond to False now. Let's store this Boolean Series as is_huge
.
is_huge = countries['population'] > 100000000
print(is_huge)
output:
IND True
MMR False
THA False
SGP False
CHN True
Name: population, dtype: bool
Step 3: Subset the DataFrame
The final step is using this boolean Series is_huge
to subset the Pandas DataFrame. To do this, we put is_huge
inside square brackets.
print(countries[is_huge])
output:
country capital population
IND India New Delhi 1393409030
CHN China Beijing 1412360000
The result is exactly what we want: only the countries with an population greater than 100000000, namely India and China.
Summary
So let's summarize this: we selected the population column, performed a comparison on the population
column and stored it as is_huge
so that we can use it to index the countries
DataFrame. These different commands do the trick. However, we can also write this in one line. simply put the code that defines is_huge
directly in the square brackets.
print(countries[countries['population'] > 100000000])
output:
country capital population
IND India New Delhi 1393409030
CHN China Beijing 1412360000
Great! Pandas help data scientists' life much easy.
Boolean operators on Pandas DataFrame
Now we haven't used boolean operators yet. Remember that we used this logical_and
function from the NumPy package to do an element-wise boolean operation on NumPy arrays? Because Pandas is built on NumPy, we can also use that function here. Let's write the codes which keep the observations that have a population between 10,000,000 and 90,000,000.
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
print(np.logical_and(countries['population'] > 10000000, countries['population'] < 90000000))
output:
IND False
MMR True
THA True
SGP False
CHN False
Name: population, dtype: bool
The only thing left to do is placing this code inside square brackets to subset countries
appropriately. This time, only Myanmar and Thailand are included. Look how easy it is to filter DataFrames to get interesting results.
print(countries[np.logical_and(countries['population'] > 10000000, countries['population'] < 90000000)])
output:
country capital population
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
Now we know about comparison operators such as
<
<=
>
>=
==
!=
and we also know how to combine the boolean results, using boolean operators such as
and
,or
andnot
.
Control Flow
Things get interesting when we can use these concepts to change how our program behaves. Depending on the outcome of our comparisons, we might want our Python code to behave differently. we can do this with conditional statements in Python:
if
,else
andelif
.
if
Suppose we have a variable x
, equal to 4. If the value is even, we want to print out: "x is even".
x = 4
if x % 2 == 0:
print('x is even.')
output:
x is even.
The modulo operator %
with 2
will return 0
if x
is even. Python checks if the condition holds. It's true, so the corresponding code is executed: "x is even" gets printed out.
Let's compare this to the general recipe for an if statement. It reads as follows: if the condition is True, execute the codes.
Notice the colon at the end, and the fact that we simply have to indent the Python code with four spaces (or a tab) to tell Python what to do in case the condition succeeds. To exit the if statement, simply continues with some Python code without indentation, and Python will know that it's not part of the if statement. It's perfectly possible to have more lines inside the if statement, like this for example.
x = 4
if x % 2 == 0:
print('Cheching if x (', x, ') is divisible by 2...')
print('x is even.')
output:
Cheching if x ( 4 ) is divisible by 2...
x is even.
The script now prints out two lines if we run it. If the condition does not pass, the expression is not executed. You can see this if we change x
to be 3
and rerun the code.
x = 3
if x % 2 == 0:
print('Cheching if x (', x, ') is divisible by 2...')
print('x is even.')
output:
There's no output. Suppose now that we want to print out "x is odd" in this case. How to do this?
else
Well, we can simply use an else
statement, like this.
x = 3
if x % 2 == 0:
print('x is even.')
else:
print('x is odd.')
output:
x is odd.
If we run it with x
equal to 3
, the condition is not true, so the expression for the else statement gets printed out. The general recipe looks like this: for the else statement, we don't need to specify a condition. The else
corresponding expression gets run if the condition of the if statements don't hold True
.
elif
We can think of cases where even more customized behavior is necessary. Say we want different printouts for numbers that are divisible by 2 and by 3. We can use some elif
in there to get the job done. Here is an example.
x = 3
if x % 2 == 0: # False
print('x is divisible by 2.')
elif x % 3 == 0: # True
print('x is divisible by 3.')
else:
print('x is not divisible by both 2 & 3.')
output:
x is divisible by 3
If x equals 3, the first condition is False
, so it goes over to check the next condition. This condition holds True
so the corresponding print statement is executed.
Suppose now that x equals 6. Both the if
and elif
conditions hold True
in this case. Will two printouts occur?
x = 6
if x % 2 == 0: # True
print('x is divisible by 2.')
elif x % 3 == 0: # never reach here
print('x is divisible by 3.')
else:
print('x is not divisible by both 2 & 3.')
output:
x is divisible by 2.
Nope. As soon as Python finds a true condition, it executes the corresponding code and then leaves the whole control structure after that. This means the second condition, corresponds to the elif
, is never reached so there's no corresponding printout. Control flow can be extremely powerful when we're writing Python scripts.
Conclusion
In this article, we learned logical comparison, control flow, and filtering on Numpy Array and Pandas DataFrame.
#python #pandas #numpy #datascience #logical-comparison #control-flow #filtering