Examples of com.cloudera.cdk.data.DatasetRepository

com.cloudera.cdk.data.DatasetRepository

A logical repository (storage system) of {@link Dataset}s.

Implementations of {@code DatasetRepository} are storage systems that containzero or more {@link Dataset}s. A repository acts as a factory, as well as a registry, of datasets. Users can {@link #create(String,DatasetDescriptor)} anew {@link Dataset} with a name and schema, or retrieve a handle to anexisting dataset, by name, by way of the {@link #load(String)} method. Whilenot expressly forbidden, most repositories are expected to support only a single concrete {@link Dataset} implementation.

No guarantees are made as to the durability, reliability, or availability of the underlying storage. That is, a {@code DatasetRepository} could be ondisk, in memory, or some combination. See the implementation class for details about the guarantees it provides.

Implementations of {@link DatasetRepository} are immutable.
@see Dataset @see DatasetDescriptor

    if (avroSchemaFile == null && avroSchemaReflectClass == null) {
      throw new IllegalArgumentException("One of cdk.avroSchemaFile or " +
          "cdk.avroSchemaReflectClass must be specified");
    }


    DatasetRepository repo = getDatasetRepository();


    DatasetDescriptor.Builder descriptorBuilder = new DatasetDescriptor.Builder();
    configureSchema(descriptorBuilder, avroSchemaFile, avroSchemaReflectClass);


    if (format.equals(Formats.AVRO.getName())) {
      descriptorBuilder.format(Formats.AVRO);
    } else if (format.equals(Formats.PARQUET.getName())) {
      descriptorBuilder.format(Formats.PARQUET);
    } else {
      throw new MojoExecutionException("Unrecognized format: " + format);
    }


    if (partitionExpression != null) {
      descriptorBuilder.partitionStrategy(Accessor.getDefault().fromExpression(partitionExpression));
    }


    repo.create(datasetName, descriptorBuilder.build());
  }

View Full Code Here

  private String datasetName;


  @Override
  public void execute() throws MojoExecutionException, MojoFailureException {
    logger.warn("CDK drop-dataset is deprecated -- please use delete-dataset");
    DatasetRepository repo = getDatasetRepository();
    repo.delete(datasetName);
  }

View Full Code Here

    }
    return conf;
  }


  DatasetRepository getDatasetRepository() {
    DatasetRepository repo;
    if (repositoryUri != null) {
      return DatasetRepositories.open(repositoryUri);
    }
    if (!hcatalog && rootDirectory == null) {
      throw new IllegalArgumentException("Root directory must be specified if not " +

View Full Code Here


    int exitCode = tool.run(input, datasetUri, datasetName);


    Assert.assertEquals(0, exitCode);


    DatasetRepository repo = DatasetRepositories.open(datasetUri);
    Dataset<GenericRecord> dataset = repo.load(datasetName);
    DatasetReader<GenericRecord> reader = dataset.newReader();
    try {
      reader.open();
      Assert.assertTrue(reader.hasNext());
      GenericRecord first = reader.next();

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct an HCatalog dataset repository using managed Hive tables
    DatasetRepository repo = DatasetRepositories.open("repo:hive");


    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
        .schemaUri("resource:user.avsc")
        .build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);


    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
      writer.open();

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");


    // Load the users dataset
    Dataset<GenericRecord> users = repo.load("users");


    // Get a reader for the dataset and read all the users
    DatasetReader<GenericRecord> reader = users.newReader();
    try {
      reader.open();

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");


    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
        .schemaUri("resource:user.avsc")
        .build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);


    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
      writer.open();

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");


    // Load the users dataset
    Dataset<GenericRecord> users = repo.load("users");


    // Get the partition strategy and use it to construct a partition key for
    // hash(username)=0
    PartitionStrategy partitionStrategy = users.getDescriptor().getPartitionStrategy();
    PartitionKey partitionKey = partitionStrategy.partitionKey(0);

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");


    // Create a partition strategy that hash partitions on username with 10 buckets
    PartitionStrategy partitionStrategy =
        new PartitionStrategy.Builder().hash("username", 10).build();


    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
        .schemaUri("resource:user.avsc")
        .partitionStrategy(partitionStrategy)
        .build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);


    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
      writer.open();

View Full Code Here


  @Override
  public int run(String[] args) throws Exception {


    // Construct an HCatalog dataset repository using managed Hive tables
    DatasetRepository repo = DatasetRepositories.open("repo:hive");


    // Load the users dataset
    Dataset<GenericRecord> users = repo.load("users");


    // Get a reader for the dataset and read all the users
    DatasetReader<GenericRecord> reader = users.newReader();
    try {
      reader.open();

View Full Code Here

0 1 2 3

TOP

Related Classes of com.cloudera.cdk.data.DatasetRepository

com.cloudera.cdk.data.filesystem.TestFileSystemURIs

com.cloudera.cdk.data.flume.Log4jAppender

com.cloudera.cdk.data.hcatalog.TestHiveURIs

com.cloudera.cdk.examples.data.CreateHCatalogUserDatasetGeneric

com.cloudera.cdk.examples.data.CreateProductDatasetPojo

com.cloudera.cdk.examples.data.CreateUserDatasetGeneric

com.cloudera.cdk.examples.data.CreateUserDatasetGenericParquet

com.cloudera.cdk.examples.data.CreateUserDatasetGenericPartitioned

com.cloudera.cdk.examples.data.DeleteHCatalogUserDataset

com.cloudera.cdk.examples.data.DeleteProductDataset

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.