Examples of cascading.tuple.Fields

cascading.tuple.Fields
Class Fields represents the field names in a {@link Tuple}. A tuple field may be a literal String value representing a name, or it may be a literal Integer value representing a position, where positions start at position 0. A Fields instance may also represent a set of field names and positions.
Fields are used as both declarators and selectors. A declarator declares that a given {@link Tap} or{@link cascading.operation.Operation} returns the given field names, for a set of values the size ofthe given Fields instance. A selector is used to select given referenced fields from a Tuple. For example;
Fields fields = new Fields( "a", "b", "c" );
This creates a new Fields instance with the field names "a", "b", and "c". This Fields instance can be used as both a declarator or a selector, depending on how it's used.
Or For example;
Fields fields = new Fields( 1, 2, -1 );
This creates a new Fields instance that can only be used as a selector. It would select the second, third, and last position from a given Tuple instance, assuming it has at least four positions. Since the original field names for those positions will carry over to the new selected Tuple instance, if the original Tuple only had three positions, the third and last positions would be the same, and would throw an error on there being duplicate field names in the selected Tuple instance.
Additionally, there are eight predefined Fields sets used for different purposes; {@link #NONE}, {@link #ALL}, {@link #GROUP}, {@link #VALUES}, {@link #ARGS}, {@link #RESULTS}, {@link #UNKNOWN}, {@link #REPLACE}, and {@link #SWAP}.
The {@code NONE} Fields set represents no fields.
The {@code ALL} Fields set is a "wildcard" that represents all the current available fields.
The {@code GROUP} Fields set represents all the fields used as grouping values in a previous {@link cascading.pipe.Splice}. If there is no previous Group in the pipe assembly, the GROUP represents all the current field names.
The {@code VALUES} Fields set represent all the fields not used as grouping fields in a previous Group.
The {@code ARGS} Fields set is used to let a given Operation inherit the field names of its argument Tuple. This Fields setis a convenience and is typically used when the Pipe output selector is {@code RESULTS} or {@code REPLACE}.
The {@code RESULTS} Fields set is used to represent the field names of the current Operations return values. This Fieldsset may only be used as an output selector on a Pipe. It effectively replaces in the input Tuple with the Operation result Tuple.
The {@code UNKNOWN} Fields set is used when Fields must be declared, but how many and their names is unknown. This allowsfor arbitrarily length Tuples from an input source or some Operation. Use this Fields set with caution.
The {@code REPLACE} Fields set is used as an output selector to inline replace values in the incoming Tuple withthe results of an Operation. This is a convenience Fields set that allows subsequent Operations to 'step' on the value with a given field name. The current Operation must always use the exact same field names, or the {@code ARGS}Fields set.
The {@code SWAP} Fields set is used as an output selector to swap out Operation arguments with its results. Neitherthe argument and result field names or size need to be the same. This is useful for when the Operation arguments are no longer necessary and the result Fields and values should be appended to the remainder of the input field names and Tuple.

        Tap inputSource = platform.makeTap(platform.makeTextScheme(), crawlDbPath);
        Pipe importPipe = new Pipe("import pipe");
        // Apply a regex to extract the relevant fields 
        RegexParser crawlDbParser = new RegexParser(CrawlDbDatum.FIELDS, 
                                                        "^(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*)");
        importPipe = new Each(importPipe, new Fields("line"), crawlDbParser);


        // Split into tuples that are to be fetched and that have already been fetched
        SplitterAssembly splitter = new SplitterAssembly(importPipe, new SplitFetchedUnfetchedSSCrawlDatums());


        Pipe finishedDatumsFromDb = new Pipe("finished datums from db", splitter.getRHSPipe());
        Pipe urlsToFetchPipe = splitter.getLHSPipe();


        // Limit to MAX_DISTRIBUTED_FETCH if running in real cluster, 
        // or MAX_LOCAL_FETCH if running locally. So first we sort the entries 
        // from high to low by links score.
        // TODO add unit test
        urlsToFetchPipe = new GroupBy(urlsToFetchPipe, new Fields(CrawlDbDatum.LINKS_SCORE_FIELD), true);
        long maxToFetch = isLocal ? MAX_LOCAL_FETCH : MAX_DISTRIBUTED_FETCH;
        urlsToFetchPipe = new Each(urlsToFetchPipe, new CreateUrlDatumFromCrawlDbDatum(maxToFetch));


        BaseScoreGenerator scorer = new LinkScoreGenerator();


        // Create the sub-assembly that runs the fetch job
        int maxThreads = isLocal ? CrawlConfig.DEFAULT_NUM_THREADS_LOCAL :  CrawlConfig.DEFAULT_NUM_THREADS_CLUSTER;
        SimpleHttpFetcher fetcher = new SimpleHttpFetcher(maxThreads, fetcherPolicy, userAgent);
        fetcher.setMaxRetryCount(CrawlConfig.MAX_RETRIES);
        fetcher.setSocketTimeout(CrawlConfig.SOCKET_TIMEOUT);
        fetcher.setConnectionTimeout(CrawlConfig.CONNECTION_TIMEOUT);


        FetchPipe fetchPipe = new FetchPipe(urlsToFetchPipe, scorer, fetcher, platform.getNumReduceTasks());
        Pipe statusPipe = new Pipe("status pipe", fetchPipe.getStatusTailPipe());
        Pipe contentPipe = new Pipe("content pipe", fetchPipe.getContentTailPipe());
        contentPipe = TupleLogger.makePipe(contentPipe, true);


        // Create a parser that returns back the raw HTML (cleaned up by Tika) as the parsed content.
        SimpleParser parser = new SimpleParser(new ParserPolicy(), true);
        ParsePipe parsePipe = new ParsePipe(fetchPipe.getContentTailPipe(), parser);
        
        Pipe analyzerPipe = new Pipe("analyzer pipe");
        analyzerPipe = new Each(parsePipe.getTailPipe(), new AnalyzeHtml());
        
        Pipe outlinksPipe = new Pipe("outlinks pipe", analyzerPipe);
        outlinksPipe = new Each(outlinksPipe, new CreateLinkDatumFromOutlinksFunction());


        Pipe resultsPipe = new Pipe("results pipe", analyzerPipe);
        resultsPipe = new Each(resultsPipe, new CreateResultsFunction());
        
        // Group the finished datums, the skipped datums, status, outlinks
        Pipe updatePipe = new CoGroup("update pipe", Pipe.pipes(finishedDatumsFromDb, statusPipe, analyzerPipe, outlinksPipe), 
                        Fields.fields(new Fields(CrawlDbDatum.URL_FIELD), new Fields(StatusDatum.URL_FN), 
                                        new Fields(AnalyzedDatum.URL_FIELD), new Fields(LinkDatum.URL_FN)), null, new OuterJoin());
        updatePipe = new Every(updatePipe, new UpdateCrawlDbBuffer(), Fields.RESULTS);


        
        // output : loop dir specific crawldb
        BasePath outCrawlDbPath = platform.makePath(curLoopDirPath, CrawlConfig.CRAWLDB_SUBDIR_NAME);

View Full Code Here

import cascading.tuple.Fields;


public class FieldUtils {


    public static Fields add(Fields fields, String... moreFieldNames) {
        Fields moreFields = new Fields(moreFieldNames);
        return fields.append(moreFields);
    }

View Full Code Here

    public void setScore(double score) {
        _tupleEntry.setDouble(SCORE_FN, score);
    }


    public static Fields getSortingField() {
        return new Fields(SCORE_FN);
    }

View Full Code Here

    }


    // ==================================================
    
    public static Fields getGroupingField() {
        return new Fields(GROUPING_KEY_FN);
    }

View Full Code Here

    public static Fields getGroupingField() {
        return new Fields(GROUPING_KEY_FN);
    }


    public static Fields getSortingField() {
        return new Fields(FETCH_TIME_FN);
    }

View Full Code Here

        
        return result;
    }


    public static Fields getParsedTextField() {
        return new Fields(ParsedDatum.PARSED_TEXT_FN);
    }

View Full Code Here

        Pipe skippedStatus = new Pipe("skipped status", new Each(splitter.getLHSPipe(), new MakeSkippedStatus()));
        
        // TODO KKr You're already setting the group name here (so that the
        // tail pipe gets the same name), so I wasn't able to pass in a
        // group name here for BaseTool.nameFlowSteps to use for the job name.
        Pipe joinedStatus = new GroupBy(STATUS_PIPE_NAME, Pipe.pipes(skippedStatus, fetchedStatus), new Fields(StatusDatum.URL_FN));


        setTails(fetchedContent, joinedStatus);
    }

View Full Code Here

    public void setGroupKey(String groupKey) {
        _tupleEntry.setString(GROUP_KEY_FN, groupKey);
    }
    
    public static Fields getGroupingField() {
        return new Fields(GROUP_KEY_FN);
    }

View Full Code Here

        String[] columnNames;
        int numChunks;
        Options options;


        public DBMigrateScheme(int numChunks, String dbDriver, String dbUrl, String username, String pwd, String tableName, String pkColumn, String[] columnNames, Options options) {
            super(new Fields(columnNames));
            this.dbDriver = dbDriver;
            this.dbUrl = dbUrl;
            this.username = username;
            this.pwd = pwd;
            this.tableName = tableName;

View Full Code Here

    WriteDRMsToSolr(Map<String, String> fields) throws IOException {
        Configuration conf = new JobConf();
        fs = FileSystem.get(conf);
        iDFieldName = fields.get("iD1");
        dRM1FieldName = fields.get("dRM1FieldName");
        inFieldsDRM1 = new Fields(iDFieldName, dRM1FieldName);
        simpleOutFields = new Fields(iDFieldName, dRM1FieldName);
        if(fields.containsKey("dRM2FieldName")){//joining DRMs so defined needed fields
            iD2FieldName = iDFieldName+"2";//just to uniqueify it from the other id field name
            dRM2FieldName = fields.get("dRM2FieldName");
            inFieldsDRM2 = new Fields(iDFieldName, dRM2FieldName);
            common = new Fields(iDFieldName);
            grouped = new Fields(iDFieldName, dRM1FieldName, iD2FieldName, dRM2FieldName);
            joinedOutFields = new Fields(iDFieldName, dRM1FieldName, dRM2FieldName);
        }
    }

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of cascading.tuple.Fields

bixo.datum.FetchSetDatum

cascading.BasicPipesPlatformTest

cascading.BasicTrapPlatformTest

cascading.BufferPipesPlatformTest

cascading.cascade.CascadePlatformTest

cascading.cascade.hadoop.RiffleCascadePlatformTest

cascading.cascade.ParallelCascadePlatformTest

cascading.CoGroupFieldedPipesPlatformTest

cascading.DistanceUseCasePlatformTest

cascading.FieldedPipesPlatformTest

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.