Bigquery | Prakash

In this section, we provide guides and references to use the Bigquery connector.

Step 1 –: Create New Service

Create New Service to click on + ADD.
The first step is to ingest the metadata from your sources. To do that, you first need to create a Service connection first.
This Service will be the bridge between Prakash and your source system.
The Add New service form should look something like this

Step 2 –: Select Bigquery Service Type

Select Bigquery as the Service type and click NEXT.

Step 3 –: Name and Describe Your Service

Provide a name and description for your Service.

Service Name:- Prakash uniquely identifies Services by their Service Name. Provide a name that distinguishes your deployment from other Services, including the other Bigquery Services that you might be ingesting metadata from.

Note that when the name is set, it cannot be change

Step 4 –: Configure The Service Connection

in this step, we will configure the connection settings required for Bigquery
Please follow the instructions below to properly configure the Service to read from your sources. You will also find helper documentation on the right-hand side panel in the UI.

Connection Details :-

hostPort:-BigQuery APIs URL.

credentials:- You can authenticate with your bigquery instance using either GCP Credentials Path where you can specify the file path of the service account key or you can pass the values directly by choosing the GCP Credentials Values from the service account key file. gcpConfig:-

Passing the raw credential values provided by BigQuery. This requires us to provide the following information, all provided by BigQuery:
type:- Credentials Type is the type of the account, for a service account the value of this field is service_account. To fetch this key, look for the value associated with the type key in the service account key file.
projectId:- A project ID is a unique string used to differentiate your project from all others in Google Cloud. To fetch this key, look for the value associated with the project_id key in the service account key file. You can also pass multiple project id to ingest metadata from different BigQuery projects into one service.
privateKeyId:- This is a unique identifier for the private key associated with the service account. To fetch this key, look for the value associated with the private_key_id key in the service account file
privateKey:- This is the private key associated with the service account that is used to authenticate and authorize access to BigQuery. To fetch this key, look for the value associated with the private_key key in the service account file
clientEmail:- This is the email address associated with the service account. To fetch this key, look for the value associated with the client_email key in the service account key file.
clientId:- This is a unique identifier for the service account. To fetch this key, look for the value associated with the client_id key in the service account key file
authUri:- This is the URI for the authorization server. To fetch this key, look for the value associated with the auth_uri key in the service account key file.
tokenUri:- The Google Cloud Token URI is a specific endpoint used to obtain an OAuth 2.0 access token from the Google Cloud IAM service. This token allows you to authenticate and access various Google Cloud resources and APIs that require authorization. To fetch this key, look for the value associated with the token_uri key in the service account credentials file.
authProviderX509CertUrl:-This is the URL of the certificate that verifies the authenticity of the authorization server. To fetch this key, look for the value associated with the auth_provider_x509_cert_url key in the service account key file.
clientX509CertUrl:- This is the URL of the certificate that verifies the authenticity of the service account. To fetch this key, look for the value associated with the client_x509_cert_url key in the service account key file.
Taxonomy Project ID (Optional):- Bigquery uses taxonomies to create hierarchical groups of policy tags. To apply access controls to BigQuery columns, tag the columns with policy tags
Taxonomy Location (Optional):- Bigquery uses taxonomies to create hierarchical groups of policy tags. To apply access controls to BigQuery columns, tag the columns with policy tags.
Usage Location (Optional): Location used to query INFORMATION_SCHEMA.JOBS_BY_PROJECT to fetch usage data. You can pass multi-regions, such as us or eu, or your specific region such as us-east1. Australia and Asia multi-regions are not yet supported.

Step 5 –: Check Test Connection

Once the credentials have been added, click on TEST CONNECTION To Check Credentials is valid or not.

If Test Connection Successful after that click on SAVE and then configure Metadata Ingestion.

Step 6 –: Configure Metadata Ingestion In this step we will configure the metadata ingestion pipeline, Please follow the instructions below. (BigQuery Metadata Image)

Step 7 –: Schedule the Ingestion and Deploy

Scheduling can be set up at an hourly, daily, weekly, or manual cadence. The timezone is in UTC. Select a Start Date to schedule for ingestion. It is optional to add an End Date
Review your configuration settings. If they match what you intended, click DEPLOY to create the service and schedule metadata ingestion
If something doesn’t look right, click the BACK button to return to the appropriate step and change the settings as needed.
After configuring the workflow, you can click on DEPLOY to create the pipeline. (Metadata Ingestion Pipeline Image)

Step 8 –: Add Ingestion Pipeline

After Schedule Interval, Add Metadata Ingestion Pipeline to click on ADD INGESTION . (IMage)

Step 9 –: View the Ingestion Pipeline Once the workflow has been successfully deployed, you can view the Ingestion Pipeline running from the Service Page (ingestion Pipeline view Image )

Step 10 –: Add Profiler Ingestion Pipeline

Add Profiler Ingestion Pipeline to click on ADD INGESTION. (Images)

Step 11 –: Configure Profiler Ingestion

In this step we will configure the Profiler ingestion pipeline, Please follow the instructions below (Profiler Image)

Profiler Configuration:- This workflow allows you to profile your table assets and gain insights into their structure (e.g. of metrics computed: max , min , mean ,etc.)

Database Filter Pattern :-

Database filter patterns to control whether to include database as part of metadata ingestion
Include: Explicitly include databases by adding a list of comma-separated regular expressions to the Include field. Prakash will include all databases with names matching one or more of the supplied regular expressions. All other databases will be excluded.
For example, to include only those databases whose name starts with the word demo , add the regex pattern in the include field as ^demo.*
Exclude: Explicitly exclude databases by adding a list of comma-separated regular expressions to the Exclude field. Prakash will exclude all databases with names matching one or more of the supplied regular expressions. All other databases will be included
For example, to exclude all databases with the name containing the word demo , add the regex pattern in the exclude field as .demo. .

Schema Filter Pattern:

Schema filter patterns are used to control whether to include schemas as part of metadata ingestion.
Include: Explicitly include schemas by adding a list of comma-separated regular expressions to the Include field. Prakash will include all schemas with names matching one or more of the supplied regular expressions. All other schemas will be excluded.
For example, to include only those schemas whose name starts with the word demo , add the regex pattern in the include field as ^demo.* . • Exclude: Explicitly exclude schemas by adding a list of comma-separated regular expressions to the Exclude field. Prakash will exclude all schemas with names matching one or more of the supplied regular expressions. All other schemas will be included. • For example, to exclude all schemas with the name containing the word demo , add a regex pattern in the exclude field as .demo. .

Table Filter Pattern:

Table filter patterns are used to control whether to include tables as part of metadata ingestion.
Include: Explicitly include tables by adding a list of comma-separated regular expressions to the Include field. Prakash will include all tables with names matching one or more of the supplied regular expressions. All other tables will be excluded.
For example, to include only those tables whose name starts with the word demo , add the regex pattern in the include field as ^demo.* .
Exclude: Explicitly exclude tables by adding a list of comma-separated regular expressions to the Exclude field. Prakash will exclude all tables with names matching one or more of the supplied regular expressions. All other tables will be included
For example, to exclude all tables with the name containing the word demo , add the regex pattern in the exclude field as .demo. .

Profile Sample:-

Percentage of data or number of rows to use when sampling tables.
By default, the profiler will run against the entire table.

Profile Sample Type:

The sample type can be set to either: o Percentage: this will use a percentage to sample the table (e.g., if table has 100 rows, and we set sample percentage top 50%, the profiler will use 50 random rows to compute the metrics) o Row Count: this will use a number of rows to sample the table (e.g., if table has 100 rows, and we set row count to 10, the profiler will use 10 random rows to compute the metrics). Thread Count:
Number of threads that will be used when computing the profiler metrics. A high number can have a negative performance effect.
We recommend using the default value unless you have a good understanding of multi-threading, and your database is capable of handling multiple concurrent connections.

Timeout (Seconds):

This will set the duration a profiling job against a table should wait before interrupting its execution and moving on to profiling the next table.
It is important to note that the profiler will wait for the hanging query to terminate before killing the execution. If there is a risk for your profiling job to hang, it is important to also set a query/connection timeout on your database engine. The default value for the profiler timeout is 12 hours.

Ingest Sample Data:

Set the Ingest Sample Data toggle to control whether to ingest sample data as part of profiler ingestion. If this is enabled, 100 rows will be ingested by default.

Enable Debug Logs:

Set the Enable Debug Log toggle to set the logging level of the process to debug. You can check these logs in the Ingestion tab of the service and dig deeper into any errors you might find.

Auto Tag PII:

Set the Auto Tag PII toggle to control whether to automatically tag columns that might contain sensitive information as part of profiler ingestion.
If Ingest Sample Data is enabled, Prakash will leverage machine learning to infer which column may contain PII sensitive data. If disabled, Prakash will infer this information from the column name.

Then, Click on NEXT to configure Profiler Ingestion Pipeline

Step 12 –: Schedule the Profiler Ingestion and Deploy

Scheduling can be set up at an hourly, daily, weekly, or manual cadence. The time zone is in UTC. Select a Start Date to schedule for ingestion. It is optional to add an End Date
Review your configuration settings. If they match what you intended, click ADD & DEPLOY to create the service and schedule Profiler ingestion.
If something doesn’t look right, click the BACK button to return to the appropriate step and change the settings as needed.
After configuring the workflow, you can click on ADD & DEPLOY to create the pipeline. (Profiler Image)

Step 13 –: View the Profiler Ingestion Pipeline: Once the workflow has been successfully deployed, you can view the Profiler Ingestion Pipeline running from the Service Page