In this post, I'll show you how to change Parquet files to CSV and the other way around. I wrote this as a note for myself, but I hope it helps others too.
Why Parquet?
Before we dive in, you might wonder, "Why Parquet?" Parquet is a columnar storage format optimized for analytics. It is widely used in big data processing tools like Apache Spark and Apache Hive. It compresses better than CSVs and reads much faster when you need to process specific columns.
Getting Started
First things first, we need to install the necessary Python libraries. Run this in your terminal:
pip install pandas pyarrow
Generating a Sample Parquet File
Let's kick things off by generating a sample dataframe and saving it as a Parquet file.
import pandas as pd
def generate_sample_parquet(filename='sample.parquet'):
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Save as Parquet
df.to_parquet(filename, index=False)
print(f"Sample Parquet file saved at: {filename}")
generate_sample_parquet()
Convert Parquet to CSV
Now that we have our Parquet file, let's convert it to CSV.
import pandas as pd
def parquet_to_csv(parquet_path, csv_path='output.csv'):
# Read Parquet
df = pd.read_parquet(parquet_path)
# Save as CSV
df.to_csv(csv_path, index=False)
print(f"Data saved to: {csv_path}")
parquet_to_csv('sample.parquet', 'sample.csv')
Convert CSV to Parquet
You might also find situations where you need to go the other way around, converting CSVs back to Parquet. Here's how you can do that:
def csv_to_parquet(csv_path, parquet_path='converted.parquet'):
# Read CSV
df = pd.read_csv(csv_path)
# Save as Parquet
df.to_parquet(parquet_path, index=False)
print(f"Data saved to: {parquet_path}")
csv_to_parquet('sample.csv', 'reconverted.parquet')
Reading Parquet Files with pandas
Reading Parquet files directly using pandas is super easy. Here's a quick snippet to help you get started:
fname = "reconverted.parquet"
df = pd.read_parquet(fname)
print(df.head())
Handling Parquet and CSV files in Python is incredibly straightforward, thanks to libraries like pandas
and pyarrow
. Whether you're diving into big data analytics or just exploring different file formats, I hope this guide proves useful to you.
Top comments (0)