Create a Glue Crawler

Create a Glue Crawler

Glue Crawler is a feature that automatically infer database and table schema from your source data then stores the associated metadata in the AWS Glue Data Catalog.

  1. Go to the AWS Glue Console.

  2. In the left navigation menu, click Crawlers.

  3. On the Crawlers page, click Create crawler.

    Create a Glue Crawler

  4. Specify nyc-taxi-crawler as the crawler’s name, click Next.

  5. On the Choose data sources and classifiers screen, specify the following information, and then click Next.

    • Click Add a data source
    • Choose a Data source – S3
    • Select Location of S3 data - In this account
    • Include S3 path – s3://serverlessanalytics-[your-account-id]-raw/nyc-taxi/
    • For Subsequent crawler runs, select to Crawl all sub-folders
    • Then click Add an S3 data source.

    Add data source

    Add data source

  6. On Configure security settings, choose ServerlessAnalyticsRole from the Existing IAM role, click Next.

    Configure security settings

  7. On the Set output and scheduling screen, click Add database.

  8. Specify nyctaxi_db as the unique database name, and then click Create database.

    Create database

  9. Go back to the previous tab (Set output and scheduling screen), refresh the selection for Target database and choose the newly created database nyctaxi_db.

  10. Specify raw_ in the Table name prefix - optional field.

  11. On the Crawler schedule, leave the frequency On demand, click Next.

Set output and scheduling

  1. Review the crawler details, click Create crawler.