Other Common Tasks
Split Data into Training and Test Datasets
train, test = dataset.randomSplit([0.75, 0.25], seed = 1337)Rename all columns
column_list = data.columns
prefix = "my_prefix"
new_column_list = [prefix + s for s in column_list]
#new_column_list = [prefix + s if s != "ID" else s for s in column_list] ## Use if you plan on joining on an ID later
column_mapping = [[o, n] for o, n in zip(column_list, new_column_list)]
# print(column_mapping)
data = data.select(list(map(lambda old, new: col(old).alias(new),*zip(*column_mapping))))Convert PySpark DataFrame to NumPy array
## Convert `train` DataFrame to NumPy
pdtrain = train.toPandas()
trainseries = pdtrain['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_train = np.apply_along_axis(lambda x : x[0], 1, trainseries)
y_train = pdtrain['label'].values.reshape(-1,1).ravel()
## Convert `test` DataFrame to NumPy
pdtest = test.toPandas()
testseries = pdtest['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_test = np.apply_along_axis(lambda x : x[0], 1, testseries)
y_test = pdtest['label'].values.reshape(-1,1).ravel()
print(y_test)Call Cognitive Service API using PySpark
Create `chunker` function
Convert Spark DataFrame to Pandas
Set up API requirements
Create DataFrame for incoming scored data
Loop through chunks of the data and call the API
Write the results out to mounted storage
Find All Columns of a Certain Type
Change a Column's Type
Generate StructType Schema Printout (Manual Execution)
Generate StructType Schema from List (Automatic Execution)
Make a DataFrame of Consecutive Dates
Unpivot a DataFrame Dynamically (Longer)
Last updated
Was this helpful?