Converting Between Parquet and CSV Files

#datascience

In this post, I'll show you how to change Parquet files to CSV and the other way around. I wrote this as a note for myself, but I hope it helps others too.

Why Parquet?

Before we dive in, you might wonder, "Why Parquet?" Parquet is a columnar storage format optimized for analytics. It is widely used in big data processing tools like Apache Spark and Apache Hive. It compresses better than CSVs and reads much faster when you need to process specific columns.

Getting Started

First things first, we need to install the necessary Python libraries. Run this in your terminal:

pip install pandas pyarrow

Generating a Sample Parquet File

Let's kick things off by generating a sample dataframe and saving it as a Parquet file.

import pandas as pd

def generate_sample_parquet(filename='sample.parquet'):
    # Sample data
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
    }

    df = pd.DataFrame(data)

    # Save as Parquet
    df.to_parquet(filename, index=False)

    print(f"Sample Parquet file saved at: {filename}")

generate_sample_parquet()

Convert Parquet to CSV

Now that we have our Parquet file, let's convert it to CSV.

import pandas as pd

def parquet_to_csv(parquet_path, csv_path='output.csv'):
    # Read Parquet
    df = pd.read_parquet(parquet_path)

    # Save as CSV
    df.to_csv(csv_path, index=False)
    print(f"Data saved to: {csv_path}")

parquet_to_csv('sample.parquet', 'sample.csv')

Convert CSV to Parquet

You might also find situations where you need to go the other way around, converting CSVs back to Parquet. Here's how you can do that:

def csv_to_parquet(csv_path, parquet_path='converted.parquet'):
    # Read CSV
    df = pd.read_csv(csv_path)

    # Save as Parquet
    df.to_parquet(parquet_path, index=False)
    print(f"Data saved to: {parquet_path}")

csv_to_parquet('sample.csv', 'reconverted.parquet')

Reading Parquet Files with pandas

Reading Parquet files directly using pandas is super easy. Here's a quick snippet to help you get started:

fname = "reconverted.parquet"
df = pd.read_parquet(fname)
print(df.head())

Handling Parquet and CSV files in Python is incredibly straightforward, thanks to libraries like pandas and pyarrow. Whether you're diving into big data analytics or just exploring different file formats, I hope this guide proves useful to you.

Goglides Dev 🌱

Converting Between Parquet and CSV Files

Why Parquet?

Getting Started

Generating a Sample Parquet File

Convert Parquet to CSV

Convert CSV to Parquet

Reading Parquet Files with pandas

Top comments (0)

Read next

Reptile Removal Cairns

ICFM’s Premier Trading Course in Delhi for Traders

Japan Medical Device Market: Driving Innovation in Asia's Healthcare Powerhouse

Does Age Affect Beard Hair Transplant Results?