[ad_1]
The code we’ll be working with on this piece is that this set of Python capabilities that use Pandas to learn in and course of information. It features a operate to learn the uncooked information in chunks, then a couple of capabilities that carry out some transformations on the uncooked information.
# data_processing.py
import pandas as pd
from pandas import DataFramedef read_raw_data(file_path: str, chunk_size: int = 1000) -> DataFrame:
csv_reader = pd.read_csv(file_path, chunksize=chunk_size)
processed_chunks = []
for chunk in csv_reader:
chunk = chunk.loc[chunk["Order ID"] != "Order ID"].dropna()
processed_chunks.append(chunk)
return pd.concat(processed_chunks, axis=0)
def split_purchase_address(df_to_process: DataFrame) -> DataFrame:
df_address_split = df_to_process["Purchase Address"].str.cut up(
",", n=3, develop=True
)
df_address_split.columns = ["Street Name", "City", "State and Postal Code"]
df_state_postal_split = (
df_address_split["State and Postal Code"]
.str.strip()
.str.cut up(" ", n=2, develop=True)
)
df_state_postal_split.columns = ["State Code", "Postal Code"]
return pd.concat([df_to_process, df_address_split, df_state_postal_split], axis=1)
def extract_product_pack_information(df_to_process: DataFrame) -> DataFrame:
df_to_process["Pack Information"] = (
df_to_process["Product"].str.extract(r".*((.*)).*").fillna("Not Pack")
)
return df_to_process
def one_hot_encode_product_column(df_to_process: DataFrame) -> DataFrame:
return pd.get_dummies(df_to_process, columns=["Product"])
def process_raw_data(file_path: str, chunk_size: int) -> DataFrame:
df = read_raw_data(file_path=file_path, chunk_size=chunk_size)
return (
df.pipe(split_purchase_address)
.pipe(extract_product_pack_information)
.pipe(one_hot_encode_product_column)
)
Subsequent, we are able to get began with implementing our first information validation take a look at. For those who’re going to comply with alongside in a pocket book or IDE, you need to import the next in a brand new file (or in one other cell in your pocket book):
import pandas as pd
import numpy as np
import pytest
from pandas import DataFrame
from data_processing import (
read_raw_data,
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
)
from pandas.testing import assert_series_equal, assert_index_equal
You possibly can learn extra on easy methods to truly run pytest (naming conventions for information and the way checks are found here, however for our case, all you could do is create a brand new file referred to as test_data_processing.py
and in your IDE as you add to the file you simply can run pytest
and optionally with “- -verbose”.
Fast Introduction to pytest and Easy Knowledge Validation Test
Pytest is a testing framework in Python that makes it simple so that you can write checks to your information pipelines. You possibly can primarily make use of the assert assertion, which primarily checks if a situation you place after assert
evaluates to True or False. If it evaluates to False, it’s going to increase an exception AssertionError
(and when used with pytest will trigger the take a look at to fail).
So first, let’s take a look at one thing easy. All we’re going to do is examine if the output of certainly one of our capabilities (the primary one to learn the uncooked information) returns a DataFrame.
As a fast apart, you’ll discover within the authentic operate we write the arrow ->
syntax so as to add sort hints to the operate the place we are saying that the operate ought to return a DataFrame. Which means that if you happen to write in your operate to return one thing apart from a DataFrame, your IDE will flag it as returning an invalid output (however this received’t technically break your code or forestall it from working).
To truly examine if the operate returns a DataFrame, we’ll implement a operate to check the read_raw_data
operate and simply name it test_read_raw_data
.
def test_read_raw_data():
"""Testing output of uncooked desk learn in is DataFrame"""
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
assert isinstance(test_df, DataFrame) # checking if it is a DataFrame
On this operate, we add a one-line docstring to clarify that our take a look at operate is simply checking if the output is a DataFrame. Then, we assign the output of the present read_raw_data
operate to a variable and use isinstance
to return True or False if the required object is of the kind you set in. On this case, we examine if the test_df
is a DataFrame
.
We are able to equally do that for the remainder of our capabilities that simply take a DataFrame as enter and are anticipated to return a DataFrame as output. Implementing it could actually appear like this:
def test_pipe_functions_output_df():
"""Testing output of uncooked desk learn in is DataFrame"""
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
all_pipe_functions = [
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
]
for operate in all_pipe_functions:
assert isinstance(operate(test_df), DataFrame)
Be aware you can additionally use the assert
assertion in a for loop, so we simply undergo every of the capabilities, passing in a DataFrame as enter and checking to see if the output can also be a DataFrame.
Implementing fixtures in pytest for extra environment friendly testing
You possibly can see above that we needed to write the very same line twice in our two totally different take a look at capabilities:
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
It’s because for each take a look at capabilities, we wanted a DataFrame as enter for our take a look at to examine if the output of our information processing capabilities resulted in a DataFrame. So you may keep away from copying the identical code in all of your take a look at capabilities, you should utilize fixtures, which allow you to write some code that pytest will allow you to reuse in your totally different checks. Doing so seems to be like this:
@pytest.fixture
def test_df() -> DataFrame:
return read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)def test_read_raw_data(test_df):
"""Testing output of uncooked desk learn in is DataFrame"""
assert isinstance(test_df, DataFrame) # checking if it is a DataFrame
def test_pipe_functions_output_df(test_df):
"""Testing output of uncooked desk learn in is DataFrame"""
all_pipe_functions = [
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
]
for operate in all_pipe_functions:
assert isinstance(operate(test_df), DataFrame)
We outline the test_df
in a operate this time that returns the uncooked DataFrame. Then, in our take a look at capabilities, we simply embody test_df
as a parameter and we are able to use it simply as we did earlier than.
Subsequent, let’s get into checking our split_purchase_address
operate, which primarily outputs the identical DataFrame handed as enter however with extra deal with columns. Our take a look at operate will appear like this:
def test_split_purchase_address(test_df):
"""Testing a number of columns in output and rows unchanged"""
split_purchase_address_df = split_purchase_address(test_df)
assert len(split_purchase_address_df.columns) > len(test_df.columns)
assert split_purchase_address_df.index.__len__() == test_df.index.__len__()
assert_index_equal(split_purchase_address_df.index, test_df.index) # utilizing the Pandas testing
Right here, we’ll examine two foremost issues:
- Does the output DataFrame have extra columns than the unique DataFrame?
- Does the output DataFrame have a special index than the unique DataFrame?
First, we run the split_purchase_address
operate, passing the test_df
as enter and assigning the end result to a brand new variable. This provides us the output of the unique operate that we are able to then take a look at.
To truly do the take a look at, we may examine if a particular column exists within the output DataFrame, however a less complicated (not essentially higher) means of doing it’s simply checking if the output DataFrame has extra columns than the unique with the assert
assertion. Equally, we are able to assert
if the size of the index for every of the DataFrames is similar.
You too can examine the Pandas testing documentation for some built-in testing capabilities, however there are just a few capabilities that primarily simply examine if two of a DataFrame, index, or Collection are equal. We use the assert_index_equal
operate to do the identical factor that we do with the index.__len__()
.
As talked about earlier than, we are able to additionally examine if a DataFrame comprises a particular column. We’ll transfer on to the following operate extract_product_pack_information
which ought to all the time output the unique DataFrame with an extra column referred to as “Pack Info”. Our take a look at operate will appear like this:
def test_extract_product_pack_information(test_df):
"""Take a look at particular output column in new DataFrame"""
product_pack_df = extract_product_pack_information(test_df)
assert "Pack Info" in product_pack_df.columns
Right here, all we do is name columns
once more on the output of the unique operate, however this time examine particularly if the “Pack Info” column is within the listing of columns. If for some purpose we edited our authentic extract_product_pack_information
operate to return extra columns or renamed the output column, this take a look at would fail. This is able to be a very good reminder to examine if what no matter we used the ultimate information for (like a machine studying mannequin) additionally took that into consideration.
We may then make do two issues:
- Make modifications downstream in our code pipeline (like code that refers back to the “Pack Info” column);
- Edit our checks to mirror the modifications in our processing operate.
One other factor we ought to be doing is checking to see if the DataFrame returned by our capabilities has columns of our desired information varieties. For instance, if we’re doing calculations on numerical columns, we should always see if the columns are returned as an int
or float
, relying on what we’d like.
Let’s take a look at information varieties on our one_hot_encode_product_column
operate, the place we do a typical step in characteristic engineering on one of many categorical columns within the authentic DataFrame. We anticipate all of the columns to be of the uint8
DataType (what the get_dummies
operate in Pandas returns by default), so we are able to take a look at that like this.
def test_one_hot_encode_product_column(test_df):
"""Testing if column varieties are right"""
encoded_df = one_hot_encode_product_column(test_df)
encoded_columns = [column for column in encoded_df.columns if "_" in column]
for encoded_column in encoded_columns:
assert encoded_df[encoded_column].dtype == np.dtype("uint8")
The output of the get_dummies
operate additionally returns columns which have an underscore (this, in fact, could possibly be executed higher by checking the precise column names- like within the earlier take a look at operate we examine for particular columns).
Right here, all we’re doing is in a for loop of goal columns checking if all of them are of the np.dtype("uint8")
information sort. I checked this beforehand by simply in a pocket book checking the information sort of one of many output columns like column.dtype
.
One other good follow along with testing the person capabilities you’ve got that make up your information processing and transformation pipelines is testing the ultimate output of your pipeline.
To take action, we’ll simulate working our total pipeline within the take a look at, after which examine the ensuing DataFrame.
def test_process_raw_data(test_df):
"""Testing the ultimate output DataFrame as a ultimate sanity examine"""
processed_df = (
test_df.pipe(split_purchase_address)
.pipe(extract_product_pack_information)
.pipe(one_hot_encode_product_column)
)# examine if all authentic columns are nonetheless in DataFrame
for column in test_df.columns:
if column not in processed_df.columns:
increase AssertionError(f"COLUMN -- {column} -- not in ultimate DataFrame")
assert all(
aspect in listing(test_df.columns) for aspect in listing(processed_df.columns)
)
# examine if ultimate DataFrame does not have duplicates
assert assert_series_equal(
processed_df["Order ID"].drop_duplicates(), test_df["Order ID"]
)
Our ultimate test_process_raw_data
will examine for 2 ultimate issues:
- Checking if the unique columns are nonetheless current within the ultimate DataFrame — this isn’t all the time a requirement, however it is likely to be that you really want all of the uncooked information to nonetheless be out there (and never remodeled) in your output. Doing so is simple- we simply must examine if the column within the
test_df
remains to be current within theprocessed_df
. Lastly, we are able to this time increase anAssertionError
(equally to simply utilizing anassert
assertion) if the column isn’t current. It is a good instance of how one can output a particular message in your checks when wanted. - Checking if the ultimate DataFrame doesn’t have any duplicates — there are a whole lot of other ways you are able to do this- on this case, we’re simply utilizing the “Order ID” (which we anticipate to be like an index) and the
assert_series_equal
to see if the output DataFrame didn’t generate any duplicate rows.
Checking the pytest output
For a fast have a look at what working pytest seems to be like, in your IDE simply run:
pytest --verbose
Pytest will examine the brand new take a look at file with all of the take a look at capabilities and run them! It is a easy implementation of getting a sequence of knowledge validation and testing checks in your information processing pipeline. For those who run the above, the output ought to look one thing like this:
You possibly can see that our ultimate take a look at failed, particularly the a part of the take a look at the place we examine if all the columns from the preliminary DataFrame are current within the ultimate. Additionally that our customized error message within the AssertionError
we outlined earlier is populating accurately—that the “Product” column from our authentic DataFrame isn’t displaying up within the ultimate DataFrame (see if you’ll find why based mostly on our preliminary information processing capabilities).
There’s much more room to enhance on this testing—we simply have a very easy implementation with fundamental testing and information validation circumstances. For extra complicated pipelines, it’s possible you’ll wish to have much more testing each to your particular person information processing capabilities, in addition to in your uncooked and ultimate output DataFrames to make sure that the information you find yourself utilizing is information you may belief.
[ad_2]
Source link